NATIONAL BUREAU OF STANDARDS

microcopy resolution test chart

DIRECT AND ITERATIVE METHODS FOR BLOCK TRIDIAGONAL LINEAR SYSTEMS

Don Eric Heller

April 1977

DEPARTMENT

of

COMPUTER SCIENCE

D D C

?f7ar?mp|z

NOV 22 19TT

gSEDTTE

Carnegie -Mel Ion University

[ON STATEMENT A

proved for public release; Distribution Unlimited

ABSTRACT

Block tridiagonal systems of linear equations occur frequently in scientific computations, often forming the core of more complicated prob- lems. Numerical methods for solution of such systems are studied with emphasis on efficient methods for a vector computer. A convergence theory for direct methods under conditions of block diagonal dominance is developed, demonstrating stability, convergence and approximation properties of direct methods. Block elimination (LU factorization) is linear, cyclic odd-even reduction is quadratic, and higher-order methods exist. The odd-even methods are variations of the quadratic Newton iteration for the inverse matrix, and are the only quadratic methods within a certain reasonable class of algorithms. Serai-direct methods based on the quadratic conver- gence of odd-even reduction prove useful in combination with linear itera- tions for an approximate solution. An execution time analysis for a pipe- line computer is given, with attention to storage requirements and the effect of machine constraints on vector operations.

PREFACE

It is with many thanks that I acknowledge Professor J. F. Traub, for my introduction to numerical mathematics and parallel computation, and for his support and encouragement. Enlightening conversations with G. J. Fix, S. H. Fuller, H. T. Rung and D. K. Stevenson have contributed to some of the ideas presented in the thesis. A number of early results were obtained while in residence at ICASE , NASA Langley Research Center; J. M. Ortega and R. G. Voigt were most helpful in clarifying these results. The thesis was completed while at the Pennsylvania State University, and P. C. Fischer must be thanked for his forbearance.

Of course, the most important contribution has been that of my wife, Molly, who has put up with more than her fair share.

ii

CONTENTS

page

1. Introduction . 1

A. Summary of Main Results 5

B. Notation 8

2. Some Initial Remarks 10

A. Analytic Tools 10

B. Models of Parallel Computation 14

3 . Linear Methods 18

A. The LU Factorization 18

1. Block Elimination 18

2. Further Properties of Block Elimination 25

3. Back Substitution 30

4. Special Techniques 31

B. Gauss-Jordan Elimination 33

C. Parallel Computation 36

4. Quadratic Methods 37

A. Odd- Even Elimination 37

B. Variants of Odd-Even Elimination 45

C. Odd- Even Reduction 48

1. The Basic Algorithm 48

2. Relation to Block Elimination 51

3. Back Substitution 56

4. Storage Requirements 58

5. Special Techniques 59

D. Parallel Computation 60

E. Summary 64

5. Unification: Iteration and Fill-In 65

iii

\

6. Higher- Order Methods 71

A. p-Fold Reduction 71

B. Convergence Rates 75

C. Back Substitution 80

7. Semidirect Methods 83

A. Incomplete Elimination 84

B. Incomplete Reduction 86

C. Applicability of the Methods 90

8. Iterative Methods 92

A. Use of Elimination 92

B. Splittings 93

C. Multi- line Orderings 94

D. Spiral Orderings 97

E. Parallel Gauss 101

9. Applications 105

A. Curve Fitting 105

B. Finite Elements 106

10. Implementation of Algorithms 110

A. Choice of Algorithms (Generalities). 110

B. Storage Requirements 113

C. Comparative Timing Analysis 120

Appendix A. Vector Computer Instructions 131

Appendix B. Summary of Algorithms and Timings 136

References 156

iv

1. INTRODUCTION

Block tridiagonal matrices are a special class of matrices which arise in a variety of scientific and engineering computations, typically in the numerical solution of differential equations. For now it is sufficient to say that the matrix looks like

where b. is an n. X n. matrix and a., c. are dimensioned conformally. Thus 1 11 -N1 1 N

the full matrix A has dimension (, n.) x ( n.).

j-1 J ‘">1 J

In this thesis we study numerical methods for solution of a block tri- diagonal linear system Ax = v, with emphasis on efficient methods for a vector-oriented parallel computer (e.g., CDC STAR- 100, Illiac IV). Our analysis primarily concerns numerical properties of the algorithms, with discussion of their inherent parallelism and areas of application. For this reason many of our results also apply to standard sequential algo- rithms. A unifying analytic theme is that a wide variety of direct and iterative methods can be viewed as special cases of a general matrix itera- tion, and we make considerable use of a convergence theory for direct methods. Algorithms are compared using execution time estimates for a simplified model of the CDC STAR- 100 and recommendations are made on this

basis .

2

There are two important subclasses of block tridiagonal matrices, depending on whether the blocks are small and dense or large and sparse.

A computational method to solve the linear system Ax = v should take into account the internal structure of the blocks in order to obtain storage economy and a low operation count. Moreover, there are impor- tant applications in which an approximate solution is satisfactory.

Such considerations often motivate iterative methods for large sparse systems, especially when a direct method would be far more expensive.

Since 1965, however, there has been an increased interest in the direct solution of special systems derived from two-dimensional elliptic partial differential equations. With a standard five-point finite differ- ence approximation on a rectangular grid, the a^, c^ blocks are diagonal and the b^ blocks are tridiagonal, so A possesses a very regular sparse structure. By further specializing the differential equation more struc- ture will be available. Indeed, the seminal paper for many of the impor- tant developments dealt with the simplest possible elliptic equation. This was Hockney's work [H5], based on an original suggestion by Golub, on the use of Fourier transforms and the cyclic reduction algorithm for the solu- tion of Poisson's equation on a rectangle. For an n X n square grid Hockney

3

was able to solve Ax * v in 0(n ) arithmetic operations, while ordinary band-

4 3

matrix methods required 0(n ) operations. Other 0(n ) methods for Poisson's equation then extant had considerably larger asymptotic constants, and the new method proved to be significantly less time-consuming in practice. Sub- sequent discovery of the Fast Fourier Transform and Buneman's stable version

2

of cyclic reduction created fast (0(n log n) operations) and accurate

methods that attracted much attention from applications programmers and

r

numerical analysts [B20], [D2], [H6]. The Buneman algorithm has since been extended to Poisson1 s equation on certain nonrectangular regions and to general separable elliptic equations on a rectangle [S6], and well- tested Fortran subroutine packages are available (e.g. , [S8]). Other recent work on these problems includes the Fourier-Toeplitz methods fF2 ] , and Bank's generalized marching algorithms [B5]. The latter methods work by controlling a fast unstable method, and great care must be taken to maintain stability.

Despite the success of fast direct methods for specialized problems there is still no completely satisfactory direct method that applies to a general nonseparable elliptic equation. The best direct method available for this problem is George's nested dissection [Gl], which is theoreti- cally attractive but apparently hard to implement. Some interesting tech- niques to improve this situation are discussed by Eisenstat, Schultz and Sherman [El] and Rose and Whitten [R5] among others. Nevertheless, in present practice the nonseparable case is still frequently solved by an iteration, and excellent methods based on direct solution of the separable case have appeared (e.g., [C4], [C5]). The multi-level adaptive techniques recently analyzed and considerably developed by Brandt [B13] are a rather different approach, successfully combining iteration and a sequence of dis- cretization grids. Brandt discusses several methods suitable for parallel computation and although his work has not been considered here, it will undoubtedly play an important role in future studies. The state-of-the-art for solution of partial differential equations on vector computers is sum- marized by Ortega and Voigt [01].

References include [B19], [Dl], [S5], [S7], [S9].

4

1

With the application area of partial differential equations in mind, the major research goal of this thesis is a general analysis of direct block elimination methods, regardless of the sparsity of the blocks. As a practical measure, however, the direct methods are best restricted to cases where the blocks are small enough to be stored explicitly as full matrices, or special enough to be handled by some storage-efficient im- plicit representation. Proposed iterative methods for nonseparable ellip- tic equations will be based on the ability to solve such systems efficiently.

Our investigation of direct methods includes the block LU factoriza- tion, the cyclic reduction algorithm (called odd-even reduction here) and Sweet's generalized cyclic reduction algorithms [S10], along with varia- tions on these basic themes. The methods are most easily described as a sequence of matrix transformations applied to A in order to reduce it to block diagonal form. We show that, under conditions of block diagonal dominance, matrix norms describing the off-diagonal blocks relative to the diagonal blocks will not increase, decrease quadratically , or decrease at higher rates, for each basic algorithm, respectively. Thus the algo- rithms are naturally classified by their convergence properties. For the cyclic reduction algorithms, quadratic convergence leads to the defini- tion of useful semi-direct methods to approximate the solution of Ax = v.

Some of our results in the area have previously appeared in [H2].

All of the algorithms discussed here can be run, with varying degrees of efficiency, on a parallel computer that supports vector operations. By this we mean an array processor, such as Illiac IV, or a pipeline processor, such as the CDC STAR- 100 or the Texas Instruments ASC. These machines fall within the single instruction stream-multiple data stream classifica- tion of parallel computers, and have sufficiently many common features to

L

5

make a general comparison of algorithms pc:sible. However, there are also many special restrictions or capabilities that must be considered when designing programs for a particular machine. Our comparisons for a simplified model of STAR indicate how similar comparisons should be made for other machines. For a general survey of parallel methods in linear algebra plus background material, [H3] is recommended.

In almost every case, a solution method tailored to a particular problem on a particular computer can be expected to be superior to a general method naively applied to that problem and machine. However, the costs of tailoring can be quite high. Observations about the behav- ior of the general method can be used to guide the tailoring for a later solution, and we believe that the results presented here will allow us to move in that direction.

6

B[A] in the analysis of direct and semidirect methods, though it has long been used to analyze iterative methods.

Under the assumption j| B[A]||< 1, when a direct method generates a sequence of matrices with || B[Mi+1]|| s || B[M. ]||, we classify the method as linear, and if || B[M^+^]|| £ || B[M^]“|| or j| B [M_^ ] 1 1 we say that it is quadratic. The ‘“-norm is used throughout; extension to other norms would be desirable but proves quite difficult.

The block LU factorization and block Gauss-Jordan elimination are shown to be linear methods, and their properties are investigated in Sec- tion 3. Sample results are stability of the block factorization when || B [ A ] j j < 1, bounds on the condition numbers of the pivotal blocks, and convergence to zero of portions of the matrix in the Gauss-Jordan algorithm.

The methods of odd-even elimination and odd-even reduction, which are inherently parallel algorithms closely related to the fast Poisson solvers, are shown to be quadratic, and are discussed in Section 4. Odd-even elim- ination is seen to be a variation of the quadratic Newton iteration for A \ and odd-even reduction can be considered as a special case of Hageman and Varga's general cyclic reduction technique [HI], as equivalent to block elimination on a permuted system, and as a compact form of odd-even elimination. We also discuss certain aspects of the fast Poisson solvers as variations of the odd-even methods.

In Section 5 we show that the odd-even methods are the only quadratic methods within a certain reasonable class of algorithms. This class in- cludes all the previous methods as particular cases of a general matrix iteration. The relationship between block fill-in and quadratic conver- gence is also discussed.

L

7

In Section 6 the quadratic methods are shown to be special cases of families of higher-order methods, characterized by || B[M^+^]|| ^ j| BTM^ ] 1 1 P- We note that the p-fold reduction (Sweet's generalized reduction [S10]) is equivalent to a 2- fold reduction (odd-even) using a coarser partitioning of A.

Semidirect methods based on the quadratic properties of the odd-even algorithms are discussed in Section 7. Briefly, a semidirect method is a direct method that is stopped before completion, producing an approximate solution. We analyze the approximation error and its propagation vis-a-vis rounding error in the back substitution of odd-even reduction. Practical limitations of the semidirect methods are also considered.

In Section 8 we consider some iterative methods for systems arising from differential equations, emphasizing their utility for parallel com- puters. Fast Poisson solvers and general algorithms for systems with small blocks form the basis for iterations using the natural, multi-line and spiral orderings of a rectangular grid. We also remark on the use of elimination and the applicability of the Parallel Gauss iteration [T2],

[H4 ] .

A brief section on applications deals with two situations in which the assumption |(B[A]j|< 1 is not met, namely curve fitting and finite ele- ments, and in the latter case we describe a modification to the system Ax = v that will obtain ||b[A]|| < 1.

Methods for solution of block tridiagonal systems are compared in Sec- tion 10 using a simplified model of a pipeline (vector) computer. We con- clude that a combination of odd-even reduction and LU factorization is a powerful direct method, and that a combination of semidirect and iterative methods may provide limited savings if an approximate solution is desired.

8

A general discussion of storage requirements and the effect of machine constraints on vector operations is included; these are essential con- siderations for practical use of the algorithms.

l.B. Notation

Let the block tridiagonal matrix

Here b. is an n. X n. matrix and a. = 0, c = 0. The subscript N in the i l l I N

triplex notation will often be omitted for simplicity. Define

L[A] = (a., 0, 0) ,

D[ A ] = (0, b , 0),

U[A] - (0, 0, c ),

B[A] = I - (D[A])_1A = (-b"1 a , 0, -b"1 c ) ,

C[A] - I - A(D[A])” 1 = (-a b“}i, 0, -c jbj^) .

Use of these notations for general partitioned matrices should be evident; nonsingularity of D[A] is assumed in the definitions of B and C. B[A] is the matrix associated with the block Jacobi linear iteration for the solu- tion of Ax = v. It also assumes an important role in the study of certain

semidirect methods and will be central to our analysis of direct methods.

9

We further define = diag ( 0 , . . . , 0 , I, 0,...,0), where the I is

Oj X n^ and occurs in the jth block diagonal position. Pre- (post-) mul- tiplication by E. selects the jth block row (column) of a matrix, so that, J N N

E . AE . , D[A]E. = E.U[A], and E. L[A ] E . = 0.

. , l i j 1 t,. . i j

i= 1 J J j=i J

for example, D[A] =

For an nXn matrix M and an n-vector v, let

Pi(M)

y. (M)

II M|| =

II Ml'

n

"j-i

n

'i*l

m. .

1 1

|m. . = p . (M ) ,

J-J1 J

maxi * c^e co"nonn>

max ^ v (M)

M

v|| = max. |v.J , n

i=l

Vi'

|M

( I mi j I ) » and

0 (M) = the spectral radius of M.

The reader should take care not to confuse row sums (p^(M)) with the spec-

tral radius (o(M)). When A = (a., b., c .) , let \(A) = max . (4 ||b . ^ a . || c-

111 j J J J J

The condition \(A) £ 1 will be used as an alternative to |j B[A]|| < 1.

-1

2. SOME INITIAL REMARKS 2 . A . Analytic Tools

Much of the analysis presented in later sections involves the matrix B[A]. Why should we consider this matrix at all?

First, it is a reasonably invariant quantity which measures the diag- onal dominance of A. That is, if E is any nonsingular block diagonal matrix conforming with the partitioning of A, let A* 3 EA , so B* = B[A*] = I-D* *A* = I - (ED) (EA) = I - D ^A = B. Thus B[A] is invariant under block scal- ings of the equations Ax = v. On the other hand, B is not invariant under a change of variables y = Ex, which induces A' = AE We have B' = B[A' ] = EBE \ and for general A and E only the spectral radius 3(B) is invariant. Similarly, C[A] measures the diagonal dominance of A by columns, and is invariant under block diagonal change of variables but not under block scaling of the equations.

Secondly, one simple class of parallel semidirect methods (Section 7) can be created by transforming Ax = v into A*x = v*, A* = MA, v* = Mv for some matrix M, where || B[A*]|| « || B[A]||. This transformation typically represents the initial stages of some direct method. An approximate solu- tion y = D[A*] *v* is then computed with error x - y = B[A*]x, so ||B[A*]|| measures the relative error. Thus we study the behavior of B[A] under various matrix transformations.

Thirdly, B[A] is the matrix which determines the rate of convergence of the block Jacobi iteration for Ax = v. This is a natural parallel itera- tion, as are the faster semi- iterative methods based on it when A is posi- tive definite. It is known that various matrix transformations can affect

the convergence rates of an iteration, so we study this property.

11

Finally, B[A] and C[A] arise naturally in the analysis of block elim- ination methods to solve Ax = v. C represents multipliers used in elimina- tion, and B represents similar quantities used in back substitution. In order to place upper bounds on the size of these quantities we consider the effects of elimination on || B[A]||.

Since we will make frequent use of the condition jj B[A]|| < 1 we now discuss some conditions on A which will guarantee that it holds.

Lemma 2.1. Let J = + J? be order n, |j| = | J^j + | J? | , and suppose K

satisfies K = J^K + J9 . Then || j|| < 1 implies p^CK) £ P^J) for 1 £ i £ n, and so || K || £ j| j||.

Proof . From |j^| + j J0 1 = |j| we have p^J^) + P^C^) = Pj_(J) for 1 £ i £ n. From K = J^K + J we have p^(K) £ o (J^) || K | j + oi(J?). Suppose |j k|| s 1.

Then P^K) £ C ^ ( JT^) || k|| + P.^^) II Kll = II K|| s || J|| || K||> which implies

|| K || £ || J j| j| K jj and 1 £ || j||, a contradiction. Thus j| k|| < 1 and

Pl(K) £ O.^) + p^J.,) = Pt(J). QED.

We note that if o ^ ( J * 0 then we actually have p^K) < p^(J) and

if p.(J,) = 0 then p . (K) = o.(J„) = p,(J). It follows that if each row of l 1 iv t 2 i

contains a nonzero element then || K|| < || j||. These conditions are rem- iniscent of the Ste in-Rosenberg Theorem ([V3], [H8]) which states that if J is irreducible, and nonnegative, J9 * 0 and p(J^) < 1 then p(J) < 1 implies p (K) < p(J) and p(J) = 1 implies p(K) = o(J). There are also cor- respondences to the theory of regular splittings [V3].

Lemma 2.2. Suppose J = (I - S)K + T is order n, |j| = | (I - S) K | + |t|, P-^S) £ P^(T) , 1 £ i £ n. Then || j|| < 1 implies p^K) £ p^J) , 1 £ i £ n, and so f| K|| £ || j|j.

12

Proof. For 1 £ i £ n, p^T) £ p.(T) + p.((I - S)K) = P^J) £ || j|| < 1, so p.(T) < 1. Now, 1 > 0i(J) = p.((I - S) K) + Pt(T) S pt(K) - p.(S)||K||

+ p.(T) & P.(K) + P.(T)(1 - || K|| ). Suppose || K|| = p^(K) > 1. Then

I > pj(K) + p (T)(l - pj(K)), and we have 1 < p^(T), a contradiction.

Thus || K || ^ 1 and pi(J) 2 Pi(K). QED.

If S - T in Lemma 2.2 then K=»SK+ (J- S), |j| = | (I - S)K|+ |s|

= | J - S j + | S | , so by Lemma 2 . 1 we again conclude || K|| £ || j||.

Similar results hold for the 1-norm.

Lemma 2.3. Let J * J + J,, , |j| 3 |j | + | J0 | , and suppose K satisfies

K - KJX + J . Then || J 1^ < 1 implies || K||x ^ || j||x .

Proof. Since JT = + J^, |jT| => | J* | + | J* | , KT = J^KT + J, and

II Ml^ = || MT||, we have || JT|| = || < 1, and by Lemma 2.1, || - || KT||

* (I JT|( " (1 J|lr QED-

Lemma 2.4. Suppose J = K(I - S) + T, | j| = | K(T - S) | + |t| , v^S) £ y^T) , 1 * i £ n. Then || j||. < 1 implies || k|L * || j|| .

Proof. We have v^M) = P^fM ), = (I - ST)KT + TT, so by Lemma 2.2

|| K^|| £ || J^||, and the result follows. QED.

To investigate the condition || B[A]||< 1, we first note that if A is strictly diagonally dominant by rows then j| B[A]|| < 1 under any partition- ing of A. Indeed, if we take J to be the point Jacobi linear iteration matrix for A, J 3 I - diag(A) ^A, * D[J], Jj * J then

|| J1|| * || J|| < 1, |J| = | JL | + | J2| and B[A] - (I - so

|| B[A]|| * i| J|| < l by use of Lemma 2.1.

Now, if we denote the partitioning of A by the vector II = (n^ ,n,, , . . . ,n^) then the above remark may be generalized. Suppose H' = (m^ ,ra9 , . . . .m^) rep- resents another partitioning of A which refines II. That is, there are inte-

Pi_1

gers 1 = pQ < p^ < . . , < pN = M4-1 such that ni = m.. If B[A ] and

1 ~ J = P i- 1 J

B'[A] are defined by the two partitions and j| B'[A]|| < 1 then [| B [ A ] 1 1 £ || B'[A]||. This is an instance of the more general rule for iterative methods that "more implicitness means faster convergence" (cf. [H8, p. 107]) and will be useful in Section 6. Briefly, it shows that one simple way to decrease || B[A]|| is to use a coarser partition.

Let us consider a simple example to illustrate these ideas. Using

2

finite differences, the differential equation -u - u + Xu = f on T 0 , 1 ] ,

n xx yy l > j >

X > 0, with Dirichlet boundary conditions, gives rise to the matrix

9

A = (-1, M, -I), M = (-1, 4+Xh", -1). By Lemma 2.1, || B[A]|| £ || J[A]|| =

2

4/ (4+Xh ) < 1. Actually, since B[A] is easily determined we can give the better estimate || B[A]|| = 2 1 1 M ^ || < 2/ (2+Xh~) .

Indeed, many finite difference approximations to elliptic boundary value problems will produce matrices A such that || B[A]|| < 1 [ V3 ] ; |{ B[A]|| will frequently be quite close to 1. This condition need not hold for certain finite element approximations, so a different analysis will be needed in those cases. Varah [Vl], [V2] has examples of this type of analysis which are closely related to the original continuous problem. In Section 9.B we consider another approach based on a change of variable.

|| B[ A ] 1 1 < l is a weaker condition than strict block diagonal dominance relative to the “-norm, which for block tridiagonal matrices is the condi- tion ([FI], [Vl]) || b j 1 1| (|| a j || + || c j 1 1 ) < 1, 1 * j s u. The general definition of block diagonal dominance allows the use of different norms.

so for completeness we should also consider theorems based on this

14

assumption. However, in order to keep the discussion reasonably concise this will not always be done. It is sometimes stated that block diag- onal dominance is the right assumption to use when analyzing block elimi- nation, but we have found it more convenient and often more general to deal with the assumption || B[A]||< 1.

Other concepts related to diagonal dominance are G- and H-matrices .

If J = I - diag(A) ^A= L + U, where L (U) is strictly lower (upper) trian- gular, then A is strictly diagonally dominant if || j|| < 1, a G-matrix if || (I - j L | ) 1 |U| || < 1, and an H-matrix if p(|j|) < 1. Ostrowski [02] shows that A is an H-matrix iff there exists a nonsingular diagonal matrix E such that AE is strictly diagonally dominant. Varga [V4] summarizes recent results on H-matrices. Robert [R3] generalizes these definitions to block diagonally dominant, block G- and block H-matrices using vector- ial and matricial norms, and proves convergence of the classical iterative methods for Ax = v. Again, for simplicity we will not be completely gen- eral and will only consider some representative results.

2 .B . Models of Parallel Computation

In order to read Sections 3-9 only a very loose conception of a vector- oriented parallel computer is necessary. A more detailed description is needed for the execution time comparison of methods in Section 10. This subsection contains sufficient information for the initial part of the text, while additional details may be found in Appendix A, [H3], [S3], and the references given below.

Parallel computers fall into several general classes, of which we con- sider the vector computers, namely those operating in a synchronous manner

and capable of performing efficient floating point vector operations.

15

Examples are the array processor Illiac IV, with 64 parallel processing elements [B6], [B12]; the Cray-1, with eight 64-word vector registers [C8]; and the pipeline processors CDC STAR- 100 [C2] and Texas Instruments ASC [Wl], which can handle vectors of (essentially) arbitrary length. These machines have become important tools for large scientific computations, and there is much interest in algorithms able to make good use of special machine capabilities. Details of operation vary widely, but there are suf- ficiently many common features to make a general comparison of algorithms poss ible .

An important parallel computer not being considered here is the asyn- chronous Carnegie-Mellon University C.mmp [W4], with up to 16 minicomputer processors. Baudet [B7] discusses the theory of asynchronous iterations for linear and nonlinear systems of equations, plus results of tests on C . mmp .

Instructions for our model of a vector computer consist of the usual scalar operations and a special set of vector operations, some of which will be described shortly. We shall consider a conceptual vector to be simply an ordered set of storage locations, with no particular regard to the details of storage. A machine vector is a more specialized set of locations valid for use in a vector instruction, usually consisting of locations in arithmetic progression. The CDC STAR restricts the progres- sion to have increment = 1, which often forces programs to perform extra data manipulation in order to organize conceptual vectors into machine vectors .

We denote vector operations by a parenthesized list of indices. Thus

the vector sum w = x+y is

16

w. = x. + y . , (1 s i s N)

1 t i

, N

if the vectors are in R , the componentwise product would be

Wi = Xi X yi’ C1 s ^ N> »

and a matrix multiplication C = AB would be

Cij=_^1aik v (1 "N; 1 sn)-

Execution time for a vector operation depends on the length of the vector

and the internal mechanism of the vector computer, but can usually be

written in the form T N + c if the vector length is N. On a pipeline

op op r r

computer, 1/ t is the asymptotic result rate and a is the vector instruc- op op

tion startup cost. On a k-parallel computer (one with either k parallel

processors or k-word vector registers) the basic time would be t for k

op

operations in parallel and the vector instruction cost t N k~ + overhead.

op

In either case the tN + a model is adequate for vector operations; scalar

operation times are given as t

op

Simplified instruction execution times

for the CDC STAR- 100 are given below, in terms of the 40 nanosecond cycle

time. Usually T « t « a

op op op

operation

scalar time

vector

time

add, subtract

13

b*

71

multiply

17

N +

159

divide

46

2N +

167

square root

74

2N +

155

transmit

11

91

maximum

-

6N +

95

summation

-

6.5N

+ 122

inner product

-

6N +

130

17

The use 01 .ong vectors is essential for effective use of a vector computer, in order to take full advantage of increased result rates in the vector operations. The rather large vector startup costs create a severe penalty for operations on short vectors, and one must be careful to design algorithms which keep startup costs to a minimum. Scalar opera- tions can also be a source of inefficiency even when most of the code con- sists of vector operations. Data manipulation is often an essential con- sideration, especially when there are strong restrictions on machine vec- tors. These issues and others are illustrated in our comparison of algo- rithms in Section 10, which is based on information for the CDC STAR- 100.

18

3. LINEAR METHODS

The simplest approach to direct solution of a block tridiagonal linear system Ax = v is block elimination, which effectively computes an LU factor- ization of A. We discuss the process in terms of block elimination on a general partitioned matrix under the assumption j| B[A]|j< 1; the process is shown to be linear and stable. Various properties of block elimination are considered, such as pivoting requirements, bounds on the condition numbers of the pivotal blocks, error propagation in the back substitution, and special techniques for special systems. Gauss- Jordan elimination is studied for its numerical properties, which foreshadow our analysis of the more impor- tant odd-even elimination and reduction algorithms.

3 .A. The LU Factorization 3.A.I. Block Elimination

The block tridiagonal LU factorization

A = (aj,b.,c.)N = (i.,I,0)N(0,d.,c.)N

is computed by

d = b 1 1

,-l

dj = bj - ajdj-l cj-l> J = 2.....N,

l3 = Vj-l’ j =

where the existence of d^. ^ is assumed and will be demonstrated when || B [ A ] 1 1 < 1. For the moment we are not concerned with the particular methods used to compute or avoid computing either Z ^ or d^ ^ cj-l"

19

Let A^ = A and consider the matrix A0 which is obtained after the first block elimination step. We have A0 = (a.,3.,cJN> 3^ = d^ = b^, a ^ = 0,

32 = d2’ aj = aj and = bj for 3 s Let Bx = B[A1], B2 = B[A,].

Note that and differ only in the second block row; more specifically, in the (2,1) and (2,3) positions.

Theorem 3.1. If J| B ^ || < 1 then || B2 || £ || B || .

Proof. Let T = B^E^ and let S = TB^. We have d2 = b9(I - b0^a0b^c ),

II b2 la2bllcl II = H SH ^ II B1 II II Ei II II Bill = II Bil|2 < 1> so d21 exists and B„ is well-defined. Also, B^ = (I - S)B, + T, { B^ | = j (I - S) B, | + j T | by con- struction, and o^(S) £ 0 j (T) 1 1 B^|[ ^ P^(T). Thus Lemma 2.2 applies, yielding

QED

Now, for any partitioned matrix the N stages of block elimination (no block pivoting) can be described by the process

for i = 1 N- 1

Ai+1 - (I ' L[Ai]EiD[A1]‘1)Ai = (I + L[C[Ai]]Ei)A..

We transform A^ into by eliminating all the blocks of column i below the

diagonal; when A is block tridiagonal A^ is the upper block triangular matrix (0,dj,Cj)^. The same effect can be had by eliminating the blocks individually by the process

20

Aj = A

for i = 1, . . . ,N- 1

Af1! = A l+l l

for j = i+l,...,N

A^j) = (I - E.Af-^E.DCA^"^ ]'1)A^"1) l+l j i+l l l+l J l+l

L = (I + EjC[Ai+r ]Ei

i+l

L

A . . = a<N>

i+l i+l

Each minor stage affects only block row j of the matrix. It is seen that

block row j of A^~ ^ = block row j of A., i+l J l

block row i of Ap,1"* = block row i of A..

i+l l

from which it follows that

E.A^YofAy- Y1 = E.A.E.D[A.]-1, j i+l l i+l J j l l L iJ

so the major stages indeed generate the same sequence of matrices A^. Let B. = B[A.].

Corollary 3.1, If A is block tridiagonal and j| B^|| < 1 then || B.+1|| ^ |j B. ||, 1 £ i SN-1,

The proof of Corollary 3.1 requires a bit more care than the proof of Theorem 3.1 would seem to indicate; this difficulty will be resolved in Theorem 3.2. We should also note the trivial "error relations"

*i+i - 'Hill s ll*i - *JI. Bi+1- 'nII sliBi- BJ-

21

These follow from the fact that, for block tridiagonal matrices,

i N

A.

1

E .

A

+

~j=i+l

EjAr

, which holds regardless of the value of j| B ^|| .

Wilkinson [W3] has given a result for Gaussian elimination that is similar to Corollary 3.1, namely that if A is strictly diagonally dominant by columns then the same is true for all reduced matrices and || A^+^ (reduced) ||^ £

|| A^ (reduced) j|^. This actually applies to arbitrary matrices and not just the block tridiagonal case. Brenner [B14] anticipated the result but in a different context.

Our next two theorems, generalizing Corollary 3.1 and Wilkinson's observa- tion, lead to the characterization of block elimination as a norm- non- increas ing

k

projection method , and proves by bounding the growth factors that block elimination is stable when ))b[A]|| < 1. In addition, many special schemes for block tridiagonal matrices are equivalent to block elimination on a permuted matrix, so the result will be useful in some immediate applications.

Theorem 3.2. If )| B 1|| < 1 then || B.+1|| <; || B.|] and j| At+1|| ^ || A.|| for 1 £ i S8-1.

Proof. Suppose || B^||< 1. We have

A1+1 3 (I - L[A.]EiD[Aif1)Ai

- A. - L[A.]E. + L[A1]E1B1,

PJ(Ai+l>* Pj(Ai ' + Pj(L[Ai]EiB.)

s P j(Ai - L[A.]E.) + oj(L[Ai]Ei)|| B.||

k

The projections are onto spaces of matrices with zeros in appropriate positions.

* Pj(A. - L[A.]E.) + PjCLCA.JE.) = P.(A.),

II Ai+lH * II Ai II Now let

Q = D[Ai]‘1L[Ai]EiD[Ai]'1A.>

30 Ai+1 = Ai ' °fAi^» D[Ai+l] = D^AiKI - D[Q ])

Let S = D[ Q ] - Since

Q = (-L[B.]E.)(I - B.) = -L[B.] + L[B.]E.B.,

S = D[L[B.]E.B.] and ||s||* ||L[B.]|| ||e.|| ||B.||s ||B.j|2 < 1. Thus D[A^+^] exists and B^+^ is well-defined. It may be verified that (I ' = Bi - S + Q. Let T = Bl - S + Q, J = S + T = B + Q. Since

D[ Bjl ] = D[Q - S] = 0, we have D[T] = 0 and |j| = |s| + |t|. Thus, by Lemma 2.1, j| j|| < 1 implies || Bi+1|| ^ j| j|| . But we have, after some algebraic manipula- tion.

J = B. + Q

i

Ci EA

yN

+ / E.B, (I - E. + E.B.) ,

Tc-i+1 k 1 1 11

and it follows from this expression that || j|| £ || B^|| . QED

Although expressed in terms of the N- 1 major stages of block elimination, it is clear that we also have, for B^ = B[A^ ], || B^ || £ || B^"1^ || and

llAij)|| 5 || X) ||. Moreover, as long as blocks below the diagonal are eliminated such inequalities will hold regardless of the order of elimination.

23

We shall adopt the name "linear method" for any process which generates a sequence of matrices such that || B[M^+^][| s || B[M^]||, given that || B[M^]||< 1. Thus block elimination is a linear method.

A result that is dual to Theorem 3.2 is the following.

Definition. Let CL~ be the reduced matrix consisting of the last N-i+1 block rows and columns of A^. = A^ = A) .

Theorem 3.3. If || CfA]^ < 1 then || C[<2.+1 ] || l * || C[£2. ] ||r \\^i+l\\l ^ H^llj. and || L[C1]E.||1 * || CCA]^, C. = C[A.].

Remark. It is not true in general that ||c || £ | j c ^ 1 1 f HCilli< 1’

Proof. Let the blocks of C . be c , 1 £ ra, p £ N. Then C\&. 1 = (c ),-<s„ i mp L mp

mp

Let K = (cmp + craicip) i+l^n,p^

= ( r ) , , •+■

' mp i+ls<n,p^I

(ci,i+l CiN)-

Using Lemma 2 .4 , But

^i+iHl!

if : k L < l.

VK>*V E C6J? ])

4 H ~k=i+l

i+l»i\

'Ni

l^qCEf C82,])

+ II cBS1]||1vq(EiCK?i])

s v-C1+i E*

r

24

1

+ vq(Et CE^])

= vq(C52.]).

Thus II k||x ^ ||c[<2i]||1< 1 and || ct^i+1] II ± 5 || C^] || 1 . By construction

this implies || £ || C [ A ] 1 1 ^ for 1 £ i £ N. Now, let the blocks of

be a , 1 ^ m, p s n. Then CL. , = (a + c .a. and

mp l+l mp mi ip i+l^n,p^l

s V^+1 Etf>

llVq(Ei°l)

' X, «V

QED

We note that if A is symmetric then C[A] = B[A ] and Theorems 3.2 and 3.3 become nearly identical. Of course, B and C are always similar since C =* DBD When A is positive definite, Cline [ C 3 ] has shown that the eigenvalues of 42 interlace those of 42^ Actually, the proof depends on use of the Cholesky factorization, but the reduced matrices are the same in both methods. In consequence of Cline's result we have ||^Z || 5

The statements of Theorems 3.2 and 3.3 depend heavily on the particular

norms that were used. It is not true in general, for example, that if ||B[A1]||1< 1 then || B[A2] ||j £ || B[Aj] ||j or even || B[£, ] ^ s || B[4^ ] ||r This makes the extension to other norms quite difficult to determine.

25

1

3. A. 2. Further Properties of Block Elimination

It is well-known that the block factorization A - LAU is given by N

L - I - ; L[c ]E ,

S-i J J

A = D[A^1,

N

U = I - B =* I - / E. U[B ].

N “j-l J J

We have therefore shown in Theorems 3.2 and 3.3 that if || 1 then

II L||x * 1 + II C1|| J_ <2, || L|| * 1 + Nn|| C1||1, n = maXj n . , and if || B1|| < 1 then || U|| * 1 + || B x j| <2, || u|| l £ 1 + Nn|| bJI . (For block tridiagonal matrices the factor N can be removed, so |j L|| £ 1 + nj| C ||^, || £ 1 + n|| B^j|.)

Moreover, the factorization can be completed without permuting the block rows of A. The purpose of block partial pivoting for arbitrary matrices is to force the inequality || L[ C ^ ]E ^ 1 1 £ 1. Broyden [B15] contains a number of interesting results related to pivoting in Gaussian elimination. In particular it is shown that partial pivoting has the effect of bounding the condition number of L, though the bound and condition number can actually be quite large in the general case. We will have some more comments on pivoting after looking at some matrix properties that are invariant under block elimination (cf. [311]).

Corollary 3.2. Block elimination preserves strict diagonal dominance.

Proof . It follows from Theorem 3.2 that Gaussian elimination preserves strict diagonal dominance: simply partition A into lxl blocks. Now suppose we have A^ and are about to generate A This can be done either by one stage of

block elimination or by n^ stages of Gaussian elimination, for both processes produce the same result when using exact arithmetic [HS, p. 130]. Thus if A^ is strictly diagonally dominant, then so is QED

r

26

It is shown in [B4] that Gaussian elimination maps G-matrices into G-matrices. This can be extended in an obvious fashion to the case of block elimination. We also have

Corollary 3.3. Block elimination maps H-matrices into H-matrices.

Proof. Recall that A is an H-matrix if and only if there exists a nonsingular diagonal matrix E such that A' = AE is strictly diagonally dominant. Con- sider the sequence

A^ = A' ,

A^+1 = (I - L[A'.]E. D[A’]-1)A^.

Clearly A^ = AjE; suppose that A^ = A^E is strictly diagonally dominant. We then have

Ai+1 = (I " L[Ai]E EiE“1D[Aif1)A1E = (I - L[At]E. DrA.f^A.E

= Ai+1E’

so A^+^ is an H-matrix since A^_^ is strictly diagonally dominant by Corollary 3.2. QED

We have chosen to give this more concrete proof of Corollary 3.3 because it allows us to estimate another norm of B[A] when A is an H-matrix. Define I|m|!e= || E ^ME 1 1. Then since A' = AE implies B[A' ] = E ^B[A]E it follows that if A is an H-matrix then || BCAi+1]||E * || BCA^H . Since the matrix E will generally remain unknown this is in the nature of an existential proof.

The proof of Corollary 3.2 brings out an important interpretation of Theorem 3.2. If block elimination is actually performed then our result

gives direct information about properties of the reduced matrices. If not, and ordinary Gaussian elimination is used instead, then Theorem 3.2 pro- vides information about the process after each n^ steps. While we are not assured that pivoting will not be necessary, we do know that pivot selec- tions for these n. steps will be limited to elements within the d. block for the block tridiagonal LU factorization. When A has been partitioned to deal with mass storage hierarchies this will be important knowledge (see Section 10. A).

One more consequence of Theorem 3.2 is

Corollary 3.4. In the block tridiagonal LU factorization, if Jj B^|j < 1 then cond(dj) = || d || || d" 1 1| < 2 cond(b.) / (1 - || B1||2 ).

Proof. Letting a , = b . *a .d . \c . . , we have d. = b.(I - a.), ||a\|| £

J J J J-l J-l J J j 11 j"

II Bx|| || Bn|| * || Bjf < 1, so that || d || s || b || (1 + || o || ) < 2 || b ||, and

* b

(I- a.)"1!!* Hb-.1!!/ (1- Hcjll) ^ || b”1 1| / (1- Hb^I2).

It is important to have bounds on cond(d^) since systems involving d^

must be solved during the factorization. Of course, a posteriori bounds on

|| d.^|| may be computed at a moderate cost. We note also that

cond(A) ^ maXj 2 cond(b^) / (1 - || B^H ) when || B^|| < 1, and it is entirely

possible for the individual blocks b. and d. to be better-conditioned than A.

J j

The inequalities in Theorem 3.2 are the best possible under the condition || B^ll < 1. This is because, for example, the first block rows of B^ and B? are identical. However, it is possible to derive stronger results by using different assumptions about B.. .

V

28

i

Theorem 3 .4 . [H4 ] For A = (a^ ,b ,c ^) , let \(A) = max^ (4 || *a^ || ||b^ ^c ^ || ) ,

e j = b'V. If X(A) S 1 then || I - e j || £ (1 - Jw\)/2 and || 1 1| <: 2/(1 + v^) .

Some useful consequences of this theorem include better estimates for

II B., || and a smaller upper bound on the condition numbers of the d.'s. For we N J

have B^ = (0,0,-dj^cJ = (0,0, -e^b^ ^c^) , which implies that

|| B^jU ^ 2 || U[B^ ] || / (1 + v/1- X) . As an example , if A = (1,3,1) then Theorem

3.2 predicts only |) B^H ^ || B^|| = 2/3, while by Theorem 3.4 we obtain

|| Bj| £ 2(1/3) / (1 + ,/l-4/9) * 0.382. Also, || d“ 1 1| £ || d~ Lb ^ || ||b~ 1 1| =

l|a'j1|| l|b j 1 II S -||b j 1 ||» and ||d. || * |^- d . || + |^ || - INjd -^ || + |^ || ^

ll b j || || I - e ^ |j + || b. j| £ 3 || b^l/2. It follows that cond(dj) = || d" X|1 ||d j || s

3 cond(bj) . More precisely, cond(dj) ^ (1 + 2 u ( A.) ) cond(bJ , u * ( 1 - '1- \)/( 1 + »/l- X) .

This is further proof of the stability of the block LU factorization.

Theorem 3.4 has been used in the analysis of an iteration to compute the block diagonal matrix (0,d^,0)^ [ H4 ] . Such methods are potentially useful in the treatment of time- dependent problems, where good initial approximations are available. This particular iteration is designed for use on a parallel computer and has several remarkable properties; for details, see [H4] and Section 8.E.

Two results on the block LU factorization for block tridiagonal matrices were previously obtained by Varah.

Theorem 3.5. [ Vl ] If A is block diagonally dominant (|| b^||(||a^||+ | (c ^ 1 1) s 1) and || Cj|| J* 0, then || d~ 1 1| * || c^f1, || || s || || / || and

11^11*11^11+ II aj II*

29

In comparison to Theorem 3.2, the bound on || d^|| corresponds to the

bound || < w^ich follows from || B^U < 1, and the bound on || d^ ||

corresponds to the bound ||a^|| ^ ||a||. In contrast to Theorem 3.2, j| B || < 1

“1 "1 2

only allows us to conclude || d^ || 5 || b^ || / (1 - j| B^|| ), and we can only estimate || ^ || in the weak form || ijcj_1|| * || a. || || dj^c^JI 5 || « ||.

Varah's second result is

Theorem 3.6. [Vl] Let a. = (|| bj^ajll II ^j\Cj ill 2 £ j £ N, and sup-

pose a. P 0. Then the block LU factorization of A is numerically stable J

(i.e., there exists a constant K(A) such that max(|| l || , || d^ || ) £ K(A)) if (cty 1 , o' j is positive semidef inite.

We note that (<*^,1,3. is a symmetrization of ( 1 1 b . ^a ^ ||, 1 , | j b ^ ^c ^ 1 1 ) and that \(A) £ 1 implies that it is positive definite. The actual bounds on || 2^ ||, || d j 1 1 given by Varah [VI] for this case will not be important for

this discussion.

Since many block tridiagonal systems arise from the discretization of differential equations, a number of results have been obtained under assump- tions closely related to the original continuous problem. Rosser [R7] shows that if the system Ax = v, A = (aj,bj,Cj), satisfies a certain maximum principle then A is nonsingular, the LU factorization exists (each d^ is non- singular) , B^ is non-negative and ||b^ || S 1. Improved bounds on B^ are obtained when A is defined by the nine-point difference approximation to Poisson's equation. In particular there is the remarkable observation that the spectral condition number of d^ is less than 17/3. Such results depend strongly on the particular matrix under consideration. For examples of related work, see [B18], [Cl], [D2].

30

3. A. 3. Back Substitution

So far we have only considered the elimination phase of the solution of Ax = v. For block tridiagonal matrices A = (a^.b^.c^^ this consists of the process

di = V fi = v

d. = b - a.d.^.c. f. = v. - a.d.^.f.

J j J 3-1 3-1 3 3 3 3-1 j-1*

3 = 2 , . . ,N.

The back substitution phase consists of the process

XN = dN fN’

X3 = dj1(f3 " CjXJ+l)* j =

In matrix terms these are the processes

Al - A, f(1) = v,

A.+1 - (I + L[C.]E.)A. , i =* 1,...,N-1,

f(L+1) = (I + L[C.]E.)f(L) , i = 1,...,N-1,

(N) „r» n-lr(N>

g = x D[A^] fv ,

x(l) g + EjB x(l+1), i =* N- 1 1 ,

X 53 X

(1)

An alternate form of back substitution is

x(N) - D[AN]’1f(N)

.(1)

X 3 X

(I + BNE. + 1)x(i+1), i = N- 1 , . . . , 1 ,

(1)

31

In either case it is seen that plays an essential role in studying the back-substitution process.

For those situations where a band elimination method for Ax = v might be

preferred, the matrix maintains its importance for back substitution. We

have A = (a.,b.,c.) = (a . , l . ,0) (0,u . ,c .) , d. = i.u., a. = a .u . _, c. = l, c., JJJ J J JJJ JJJ J J-l J JJ

A A

where .^(u^) is lower (upper) triangular. The elimination phase generates

* * 1 * - 1

vectors f. = 2.f., and the back substitution computes x^ = u^ f^ = f ,

A-l * A . -1

x. = u. (f. - c.x.,,) = d. (f. - c.x.,..). Except for the specific form of J J V J J J+l J J J J+l

roundoff errors the band and block back substitutions must behave identically. Let us briefly consider the effect of rounding errors in back substitu-

tion; the computed solution will be y . = x . + * and we require bounds on f . Suppose the computed forward elimination yields d^ = d^ + 5 , f^ = f^ -4- cp ^ .

JU

Let 7). represent the errors introduced in forming f. - c.y. ,, so that y.

J J J J+l J

* (2) *

satisfies (d . + 6. )y. = f. - c.y. , + T|. if ordinary Gaussian elimination

J J J J J J+l J

*

is used to solve for y. in d. y. = f. - c.y.,, + T|.. It follows that

J J 'J J J J+l J

v. = d.^(f. - c.y.,,) + e., where e. = d.^(®. + 1. - (6^ + 6^ )y.), which J J J J J+l j j J J j J j j

implies = - djlcj?j+l- Thus H Sjll 5 II BNH II ?j+lH+ II ejH» and the

previous errors tend to be damped out in later computations if ||b^||< 1. If

we have a uniform upper bound for the e^'s, || || £ e, and II Bj,|| < 1, then

we

obtain J! Sj|| 5 (N - j + 1) e> || J] * d - || j+1) e/ d - \^\\> < •/ (1- |)bn|J)

3. A. 4. Special Techniques

In some cases the matrix A = (a^b^c^) Possesses special properties that may be important to preserve in order to satisfy a strong stability criterion. For example, we might have b

j

t . - a , - c . where J j J

t j || is small

or zero; this requires that all blocks be n X n. Scalar tridiagonal systems

J

r

32

with this property arise from the discretization of second order ODE boundary value problems. Babuska [Bl-3] has discussed a combined LU - UL factoriza- tion for such a matrix which allows a very effective backward error analysis to the ODE and a stable numerical solution of a difficult singular pertur- bation problem posed by Dorr [ D3 ] . (See [H7] for a symbolic approach to this problem.)

There are two essential features to Babuska' s method. The first is to express b^ and d^ indirectly by starting with t ^ and computing the se- quence s^ = t^, Sj = tj - ]_• By this we have d = s. - c^. The

second feature is to augment the forward elimination with a backward elim- ination and dispense with the back substitution by combining the results of elimination. Specifically, suppose A = (— p j , t ^ + p^. + q^.-q^)^.

Pj.q^.tj £ 0, p^ = q^ = 0, A nonsingular. The algorithm for Ax = v is

V fl

V SN

V fN = V

-1

Sj = Cj + Pj(Sj-l+ qj-l} SM

£j ' vj + pj<sj-i +

_ \ s, = t + q .(s , , . + p^.) s

j

*j 'j+l rj+i “j+i

f. v + q . (s . . + p. .

J j V J+l KJ+1 J+l

j 2 , . . . ,N,

j - N-

(s. + s* - tj)_1(fj + f * - v.), j = 1, . . . ,N.

In matrix terms, we factor A = A^ + AQ + A^ = (I + L) (D + Ay) (I + U ) (D + A^) and solve (I + L) f v, (I + U ) f * v. Adding (D + A^)x = f, (D + A^)x = f

"k

and -Aqx -Aqx we obtain (D + D - Ap)x = f + f - Ax = f + f -v.

J

33

Babuska's a^^orithm has some unusual round-off properties (cf. [M3]) and underflow is a problem though correctable by scaling. It handles small row sums by taking account of dependent error distributions in the data, and thus fits into Kahan's notion that "conserving confluence curbs ill- condition" [Kl].

To see how the method fits in with our results, we note that, for n = 1,

since p.,q.,t i 0 we have s £ 0, s = t. + p.d'^s. , £ t.. Thus the row J J J 1 J J J J-l J-l J

sums can only increase during the factorization. On the other hand, they

will not increase much since s.“t.+p.(s,,+q, ,)^s . £ t. + p..

J J J J-l J-l j-l J KJ

These are analogues of the results in Theorem 3.2.

3 . B. Gauss-Jordan Elimination

An algorithm closely related to block elimination is the block Gauss- Jordan elimination

Aj = A,

Ai+i - (I + C[A.1E.)A., i = 1, . . . ,N.

This avoids back substitution in the solution of Ax = v since A^_^ is block diagonal. The rest of the Gauss-Jordan process is

-f(D

f (1+1) - (i +

C[A. ]E.)fa)

1 )

Theorem 3.7. Let B. - B[A,]. If || B^l < 1 then ||A.+1|| £ || A. || and

II ®i+ill s II II for L =

34

Sketch of Proof. The proof parallels that of Theorem 3.2. Note = B^; suppose || B^ || < 1. Let F^ = L[A^] + UfA^], so that A^+^ = (I - F.E.D[A ]”*)A.. = A. - F.E. + F.E.B.. This will show the first part. Now define

i ii ill.

Q = D[Ai]‘1F.E.D[A.]'1Ai, so that A.+1 = A. - D[A.]Q, D[A.+1] = D[A.](I - S) ,

S = D[Q ] . We have Q = (- B.E.)(I - B.) = - B.E. + B.E.B., so S = D[B.E.B.]

1 i l ii ill ill

and || S || £ || B.J| < 1. Also, (I - S)B.+1 =■ B. - S + Q. Let T = L - S + Q,

J = S + T, so that | J | = j S | + | T | , and Lemma 2 . 1 applies if || J j| < 1. But J = B^(I - E^ + E^B^) , so || J 1 1 ^ || B ^ 1 1 . From Lemma 2.1 we conclude

QED

When applied to block tridiagonal matrices the block Gauss-Jordan algo- rithm requires many more arithmetic operations than the block LU factoriza- tion, and cannot be recommended on this basis. However, it has some numer- ical properties that foreshadow our analysis of the more important odd-even reduction algorithm.

The Gauss-Jordan elimination on block tridiagonals may be expressed in the following way:

for j = 1, . . . ,N- 1

r : 1

We observe that, for 1 S i <j+k SN, -d~ 1e . , . =• u .u u , , , ,

J J,j+k J J+j. j+k-T

which is the (j,j 4- k) block of B^. Thus we are able to refine Theorem 3.7 by giving closer bounds on the first i-1 block rows of B..

l

Corollary 3.5. If ||b^|| <1 and (point) row L is contained in block row j with 1 ^ j S i-1, then

If (point) row l is contained in block row j with i £ j £ N, then

VV = W * II Bill

This is our first result on the convergence to zero of portions of a matrix during the reduction to a block diagonal matrix. It also provides a rather different situation from Gauss-Jordan elimination with partial pivoting applied to an arbitrary matrix [PI]. In that case the multipliers above the main diagonal can be quite large, while we have seen for block

36

tridiagonal matrices with j| B^|| < 1 that the multipliers actually decrease in size.

3 .C . Parallel Computation

We close this section with some remarks on the applications and inher- ent parallelism of the methods; see also [H3]. When a small number of parallel processors are available the block and band LU factorizations are natural approaches. This is because the solution (or reduction to tri- angular form) of the block system d^u. = -c . is naturally expressed in terms of vector operations on rows of the augmented matrix (d^|-Cj), and these vector operations can be executed efficiently in parallel. However, we often cannot make efficient use of a pipeline processor since the vec- tor lengths might not be long enough to overcome the vector startup costs. Our comparison of algorithms for a pipeline processor (Section 10. C) will therefore consider the LU factorization executed in scalar mode. We also note that the final stage of Babuska's LU-UL factorization is the solution of a block diagonal system, which is entirely parallel.

On the other hand, the storage problems created by fill-in with these methods when the blocks are originally sparse are well-known. For large systems derived from elliptic equations this is often enough to disqualify general elimination methods as competitive algorithms. The block elimina- tion methods will be useful when the block sizes are small enough so that they can be represented directly as full matrices, or when a stable implicit representation can be used instead.

37

4. QUADRATIC METHODS

In Section 3 we have outlined some properties of the block LU factor- ization and block Gauss-Jordan elimination. The properties are character- ized by the generation of a sequence of matrices >L such that || B[Mi+^]|| ^ || B[1L]||. From this inequality we adopted the name "linear methods". Now we consider some quadratic methods, which are characterized by the inequality || B[Mi+1]|| £ || B[Mi ]2 || .

There are two basic quadratic methods, odd-even elimination and odd- even reduction. The latter usually goes under the name cyclic reduction, and can be regarded as a compact version of the former. We will show that odd-even elimination is a variation of the quadratic Newton iteration for A \ while odd-even reduction is a special case of the more general cyclic reduction technique of Hageman and Varga [HI] and is equivalent to block elimination on a permuted system. Both methods are ideally suited for use on a parallel computer, as many of the quantities involved may be computed independently of the others [H3]. Semidirect methods based on the quadratic properties are discussed in Section 7.

4 .A. Odd-Even Elimination

We first consider odd-even elimination. Pick three consecutive block equations from the system Ax = v, A = (a^.b^jCj):

V1V2 + bk-iVi + ck-ixk = Vi Vk-i + Vk + Vfen \

ak+lXk + hk+l^+l + Ck+lXk+2 = vk+l

If we multiply equation k- 1 by

-\b

-1

k- 1

equation k+1 by

■Ckbkil* and 3dd»

the

result is

38

(4.2)

("akbk-lak-l)xk-2

+ (bk ' Vk-Vk-l - Ckbk!iak+l)xk

+ (-Ckbk+lCk+l)xk+2

= (vk " - CkClVl)-

For k = 1 or N there are only two equations involved, and the necessary modifications should be obvious. The name odd-even comes from the fact that if k is even then the new equation has eliminated the odd unknowns and left the even ones. This is not the only possible row operation to achieve this effect, but others differ only by scaling and possible use of a matrix commutativity condition.

By collecting the new equations into the block pentadiagonal system

(2)

H^x - v , it is seen that the row eliminations have preserved the fact that the matrix has only three non-zero diagonals, but they now are farther apart. A similar set of operations may again be applied, combining equa- tions k-2, k and k+2 of to produce a new equation involving the unknowns

f 3)

x^_^> x^ and x^^ These new equations form the system H^x = vv .

The transformations from one system to the next may be succinctly ex- pressed in the following way. Let = A, v^ = v, m = flog,, N*1 , and define

Hi+1 = (I + CfV)Hi’

v(l+n - (I + C[H.])v(l), i - 1,2,. ..,m.

Block diagrams of the H sequence for N = 15 are shown in Figure 4.1a. The

distance between the three non-zero diagonals doubles at each stage until

the block diagonal matrix H , is obtained; the solution x is then obtained

nH-l

, - 1 (nH-1)

by x = H ,vv 7 nH-1

40

We note that the same elimination principle may be applied to matrices of the periodic tridiagonal form

It is only necessary to include row operations that "wrap around" the matrix, such as

row(N) - a b 1 row(N-l) - c b 1 row(l).

N N- 1 N I

An H sequence for N 3 16 is diagrammed in Figure 4.1b, and is constructed by the same general rule as before, though there are serious and interesting complications if N is not a power of 2.

Clearly, we can define an H sequence for any partitioned matrix A, and only have to establish conditions under which the sequence may be continued.

In fact, the sequence ||b[H^]|| is quadratic when || B[A]|| < 1.

Theorem 4.1. If || B[A] || < 1 then || B[H1+1j|| * || B[H._]2 || and || H.+1 1| ^ || H . || .

If || C[A ] 1 1 x * 1 then || C[H1+1] ||x * || C^]2^ and || * || hJj .

Proof. We have 1^ - A, Hi+1 = (I + = (21 - =

2H. - H1D[H1]_1H1 - Hi(2l - D[Ht]“ 1H£) » H^I + B[H. ]) . Suppose || B[H^] || < 1,

II C[H^]||j< These assumptions are independent of one another and will be kept separate in the proof. Since Hi * DfH^JCI - = (I - CfH^pDfH^],

we have H1+1 a D[H.](I - B[HJ2) - (I - C[H1]2)D[Hi]. Setting S - D[B[Hi]2 ] ,

41

T - D[C[Hi ]2 ] , we obtain D[Hi+1] 3 D[Hi](I - S) 3 (I - T)D[Hi] . Since

|| Sj| s || B[ HL ]2 || < 1, || T || x * || CCHi]2 || x < 1, and is invertible,

it follows that is invertible. We have B[ + ^ ] = I - D[ ] H^j3

I - (I - S)'1D[Hif1D[H.](I - B[H.]2), so B[H._]2 3 ( I - S)B[H.+1] + S;

2

similarly CfH^] C[Hi+1](I - T) 4- T. Now, S is block diagonal and the

block diagonal part of B[H.+^] is null, so |B[H^] | = |(I - S)B[Hi+^]j + |s|.

2

Lemma 2.2 now applies, and we conclude || B[H^+^]|| ^ || B[H^] ||. Lemma 2.4

implies that || C[H.+1] 1^ * || CtH.]2^. To show || H.+l|| * || H.|| if

||B(H.]ff< 1, write H. + 1 3 D[H.](I- D[ H . ]" ^DfH. ] - HJBCH^) 3 D[H±J

N

+ (H. - D[H. ])B[H. ]. We then have, for 1 s 1 s n.,

i L 1 ~j=1 J

Px(H1+l) * ^(D[Ht]) + 3i(Hi - D[H.]) || B[H.]||

£ 02(D[H.]) + C£(H. - D[H.])

3 0i(H.) .

To show || H^JIj * || Hj^ if || CC^JHj < 1, write H. + 1 3 (I - C[H.1(D[H.] - H. ) x D[H,. fSorH. ] =. D[Hi] + C[Hi](Hi - D[H^]) . We have

Y2(Hi+l) * VD[Hi]) + II C"HilH - °tHi^

5 Y£(D[H. ]) + y£(H. - D[Ht])

=* v^CH^ . QED

For a block tridiagonal matrix A we have BtH^^] 3 0 since is

block diagonal; thus the process stops at this point. The proof of Theorem 4.1 implicitly uses the convexity- like formulae

Hi+1 = Hi B[Hi] + D^Hi](I

3 C[Hl Hl+ (1 - C[Ht])D[Hi].

42

As with Theorems 3.2 and 3.3, it is not difficult to find counterexamples to show that Theorem 4.1 does not hold for arbitrary matrix norms. In particular, it is not true that || B[A ] |J2 < 1 implies || B[H2 ] j(2 £ || B[HX] |(2 or even that || B [ H2 ] 1 1 2 < 1.

The conclusions of Theorem 4.1 are that the matrix elements, considered as a whole rather than individually, do not increase in size, and that the off-diagonal blocks decrease in size relative to the diagonal blocks if j| B [A ] { j < 1. The rate of decrease is quadratic, though if || B[A]|| is very close to 1 the actual decreases may be quite small initially. The relative measure quantifies the diagonal dominance of H^, and shows that the diagonal elements will always be relatively larger than the off-diagonals. One con- clusion that is not reached is that they will be large on an absolute scale. Indeed, consider the matrix A = (-l,2+e,-l), for which the non-zero diagonals of are

(-l,2+et,-l) / l£:J(2+ek)

S’ ek+l = 4ek + ek '

For small e the off-diagonal elements will be about 2^ ^ and the main diagonal 2 i.

elements about 2 in magnitude. While not a serious decrease in size, periodic rescaling of the main diagonal to 1 would perhaps be desirable.

As already hinted in the above remarks, a more refined estimate of || B[HX+X]|| may be made when dealing with tridiagonal matrices with constant diagonals [S2]. Let A = (a,b,a) , so that

1

43

b-a2 b

-a2/b

b-2a2/b

-a

b-2a2/b

-a2 b

- a b

2 |a2/b I / j b-2a^/b I

-a

b-2a2.'b

0

b2-a2/b /

It follows that when 2|a| < jbj, || B[H?]|

'(2 - ||B[H1]2||), and thus |j| B^T || < |f B[H2 ] || < ||b[H1]2||. Note that does not have constant diagonals, nor will any of the other H matrices, but most of the rows of H, are essentially identical, and it is these rows which determine |j B [ ] f J - This property will be important for the odd-even reduction algorithms.

When A. = (a^,b ,Cj) is block, diagonally dominant, we can write Hi ' (aJi),0,...,0,bj(i),0,...,0,cja)), S<D = !|(b^i))-1||(|^i)||+ |Mji> il ) ,

,(i)

,(i)

maXj j* ' Without proving it here, we can show that if 3'‘i' <1 then

3 ( i"*" 1 ) s 3^2. An irreducibility condition plus 0^ £ 1 yields the same conclusion.

We also note that if A is symmetric then so is IL, i a 1, and thus

B[Hi]

Theorem 4.2. If A is block tridiagonal and positive definite then so is H., i i 1, and P(B[H.]) < 1.

Proof. First we observe that each is a 2-cyclic matrix [V3] since it can be permuted into block tridiagonal form. Suppose that is positive definite. Since * D[H^] is also positive definite, there exists a positive definite

matrix such that

E2t. ut so ft1+1- 2a. - a2;

44

IL and Hi+1 are positive definite iff Hi and Hi+1 are positive definite.

Since is a positive definite 2-cyclic matrix, we know from Corollary 2 of Theorem 3.6 and Corollary 1 of Theorem 4.3 of [V3] that P (B^) < 1, 5^= I - 1L Since = I - D^IL = E^B.EL , f°lf°ws that p (B^) < 1 also. Let X be an eigenvalue of H^. Then -1 < 1-X < 1 since PClL) < 1, and so we have 0 < X < 2 and 0<2X - X^ Si. But 2X - X^ is an eigenvalue of = 2H^-H^,

which is therefore positive definite. QED

Any quadratic convergence result such as Theorem 4.1 calls to mind the Newton iteration. If ‘Ji(Y,Z) = Y(2I - ZY) = (2 1 - YZ)Y, Y(0) is any matrix with || I - ZY ^||a< 1 for some norm || ||a, and Y^i+^ = 3l(Y^ ,Z) , then Y^ converges to Z 1 if it exists and (I - ZY^i+^) = (I - ZY^)^[H8].

It may be observed that = 3l(H^,D[H^] ^) , so while the H sequence is not

a Newton sequence there is a close relation between the two. Nor is it entirely unexpected that Newton's method should appear, cf. Yun [Yl] and Kung [K2] for other occurrences of Newton's method in algebraic computations. Kung and Traub [K3] give a powerful unification of such techniques.

The Newton iteration Y^+^ * 2t(Y^ ,Z) is effective because, in a certain sense, it takes aim on the matrix Z ^ and always proceeds towards it.

We have seen that odd-even elimination applied to a block tridiagonal matrix eventually yields a block diagonal matrix. The general iteration H.^ 3 S incorporates this idea by aiming the i— step towards

the block diagonal matrix DfH^]. Changing the goal of the iteration at each step might be thought to destroy quadratic convergence, but we have seen that it still holds in the sequence || B [ ] 1 1 -

4.B. Variants of Odd- Even Elimination

There are a number of variations of odd-even elimination, mostly dis- cussed originally in terms of the odd-even reduction. For constant block diagonals A = (a,b,a) with ab = ba, Hockney [H5] multiplies equation k- 1 by -a, equation k by b, equation k+1 by -a, and adds to obtain

('a2)\-2 + (b2 - 2a2)xk + ('a2)xk+2 = (bvk - avk-l ' avk+l}-

This formulation is prone to numerical instability when applied to the dis- crete form of the Poisson equation, and a stabilization has been given by Buneman (see [B20] and the discussion below).

With Hockney's use of Fourier analysis the block tridiagonal system Ax * v is transformed into a collection of tridiagonal systems of the form (- 1 , \,- 1) (sj) = (tj)> where \ ^ 2 and often X » 2. It was Hockney's obser- vation that the reduction, which generates a sequence of Toeplitz tridiagonal matrices (-1,X^,-1), could be stopped when l/X^ fell below the machine pre- cision. Then the tridiagonal system was essentially diagonal and could be solved as such without damage to the solution of the Poisson equation.

Stone [S2], in recommending a variation of cyclic reduction for the parallel solution of general tridiagonal systems, proved quadratic conver- gence of the ratios l/X^ and thus of the B matrices for the constant diagonal case. Jordan [Jl], independently of our own work [H2], also extend- ed Stone's result to general tridiagonals and the odd-even elimination algo- rithm. The results given here apply to general block systems, and are not restricted solely to the block tridiagonal case, though this will be the major application.

46

We now show that quadratic convergence results hold for other formu- lations of the odd-even elimination; we will see later how these results carry over to odd-even reduction. In order to subsume several algorithms into one, consider the sequences = A, v^ = v,

»i+i Fi<2°rsi) -

v(l+1> = F . (2D[H. 1 -

1 L 1J 11

where F^, G^ are nonsingular block diagonal matrices. The sequence H cor- responds to the choices F. = I, G. = DfH.l .

i l i

In the Hockney- Buneman algorithm for Poisson's equation with Dirichlet boundary conditions on a rectangle we have

(a,b,a) , ab = ba, a

(i)>,

a, b

b,

F = diag (bv“')» G

ba) - b, b(i+» 1

.<» - a, a<i+1) =

DCH.f1,

b<i)2 - 2a(i)2,

-a

(i) 2

Sweet [Sll] also uses these choices to develop an odd-even reduction algo- rithm. We again note that the matrices lose the constant diagonal prop- erty for i 2 2, but symmetry is preserved. Each block of H. may be repre- sented as a bivariate rational function of a and b; the scaling by en- sures that the blocks of block rows j2i of are actually polynomials in

a and b. In fact, the diagonal blocks of these rows will be b^i+^ and the non-zero off-diagonal blocks will be a^i+^.

Swarztrauber [S6] describes a Buneman-style algorithm for separable elliptic equations on a rectangle. This leads to A = (a^.b^Cj) where the

blocks are all n x n, and a. = ry.I , b. = T + 3.1 , c. = y.I . In this

J J > J J n

case the blocks commute with each other and may be expressed as polynomials

47

in the tridiagonal matrix T. Swarztrauber' s odd-even elimination essenti- ally uses = diag(LCM(b i- 1 > bj+2i-l^ » = Tn for j s 0 or

j 2 S + 1, G. = D[Hl ]” 1

Stone [S2] considers a similar algorithm for tri- diagonal matrices (n = 1) but with = diagfb^^i-l bj+2i-l^’ Gi =

Note that symmetry is not preserved.

In all three cases it may be shown by induction that H

i+1

F F.H. ,

l 1 l+l

and v(i+1) = F^...F^v^+^\ These expressions signal the instability of the basic method when F^ ^ I [B20]. However, in Section 2 .A we showed that the B matrices are invariant under premultiplication by block diagonal

matrices, so that BfH^] * B[H^].

(i)

The Buneman modifications for stability represent the vv sequence by

sequences p^ , q^ with v^ = + q^. Defining = D[H. ],

Q. = D. - H., S. = D[Q.G.Q.] and assuming D.G.Q. = Q.G.D., we have

xi i li LXi ixiJ & i ixi xi i i’

,(» -

(i+1)

0, q

(1)

.CD

v,

p(l) + D^1(Q.p(i) + q(i)) q(l+1) = F.(Q.G.q(l) +S,p(l+1)).

Theorem 4.3. Let P = I, P. = B[H._n ] . . ] , i

o(I) <VPi V*-

Then p

(i)

(I - P^x,

Remark. Although not expressed in this form, Buzbee et al. [B20] prove spe- cial cases of these formulae for the odd-even reduction applied to Poisson's equation.

Proof. To begin the induction, * 0 = (I - P^)x, q*'^ = v = H^x

3 (®1^1 " Suppose p^ = (I - E\)x, q ^ * (D^P^ - Q^x. Then

■Ar

LCM a least common multiple of the matrices considered as polynomials in T.

43

p(l+1) = (I - Px)x + - P.)x + (D.P. - Ql)x)

= (I - P. + 5T 1Qi - D^Q.P. + D^D.P. - D^1Q.)x

= (I - B[H.]P.)x

= (I - Pi+1>*

, ( i+1) _ - (i+1)

= Hi+lX - W1 - Pi+l)X

(P>i+l " ^i+1 ' °i+l + °i+lPi+l)x = <5i+lPi+l " WX‘

We note that when = DfH^] 1 and S = || B [ A ] 1 1 < 1 then ||P.J| = and Pk(D.P. - Q.) * Pk(D . ) || P . || + Pk(Q.) * ^(D.) + ok(Q.) = Pk(H.) , so || D^I\ - Q^JI £ || || . It also follows that || x - || ^ || K|| || x|| and

|| q^"1 + Q^xJI £ || || || x|| . Thus p^ will be an approximation to x;

Buzbee [B17] has taken advantage of this fact in a semidirect method for Poisson's equation.

/"'-I

4 .C. Odd- Even Reduction 4.C.I. The Basic Algorithm

The groundwork for the odd-even reduction algorithms has now been laid. Returning to the original derivation of odd-even elimination in (4.1) and (4.2), suppose we collect the even-numbered equations of (4.2) into a linear system A*'2^2'* = w^, where

A< ' ' ('S2jb2 j-ia2j-l,b2 j" a2Jb2J-lc2j-l * °2 jb2 j+l*2J+l’"C2 j^j+l^J+l’iy

*«> - <*2)V, '

49

W< } " (V2j " a2jb2j-LV2j-l " C2jb2j+lV2j+l)N2 =

N2 = LN/2J.

(2)

This is the reduction step, and it is clearly a reduction since A is half

(1) (2) (2) (2) (2)

the size of A =* A. Once A x = w has been solved for x we can

compute the remaining components of x^ * x by the back substitution X2j-1 = b2j-l(v2j-l ' a2j-lx2j-2 " C2j-lX2j)’ j = N/2'’

Since is block tridiagonal we can apply this procedure recursively

to obtain a sequence of systems A^x*'^ = , where A^ = A, w^ = v,

x(L) = (xj2i-l>Ni» - N, N.+1 = LN./2J. We write Avi; = (a^; ,c >

x^ = x . 2 l- 1 The reduction stops with A^x^ = w^m\ m = log^ \j+l”

(note the change in notation) since A^ consists of a single block. At

this point the back substitution begins, culminating in x*"^ = x.

m rn+ 1“ i

Before proceeding, we note that if N = 2 -1 then = 2 - 1. The

odd-even reduction algorithms have usually been restricted to this particular choice of N, but in the general case this is not necessary. However, for A = (a,b,a)^ there are a number of important simplifications that can be made if N = 2m - 1 [B20]. Typically these involve the special scalings men- tioned earlier, which effectively avoid the inversion of the diagonal blocks during the reduction.

It is also not necessary to continue the reduction recursively if (k) (k) (k)

A x = wv can be solved more efficiently by some other method. This

leads to various polyalgorithms and the semidirect methods of Section 7.

By the way in which A^ has been constructed, it is clear that it

(4)

consists of blocks taken from H^. For example, in Figure 4.1a, A is

(*> = /-(T) u(i) „(*)■

50

the (8,8) block of . It follows immediately that || A

* H .

B[Hi] ||, and || C[A(i) ] * || C[H(l) ] || . As a corollary

of Theorem 4 . 1 we actually have

Corollary 4.1. If ||b[A]||<1 then || B[A( L+1) ] || ^ || B[A (l) ]2 || and ||A(l+l)||i If ||C[A]||1< 1 then || C[A( I+1> ) ||j s || C(A(1) ]2 ||j

anh ||A<1+1)||lS Ha'0^.

1 1.1 j _ r ! j £ -m *■» / i _ a. A i \ _ / i I "* 1 _ II 1 1 1

and A

Recalling the definition of Theorem 3.4, let X(A) = maxj(4||bj a_. || ||b_.

Theorem 4 .4 . If X(A) £ 1 then, with X^ = X(A^), we have

t(l+1) <; (X(l)/(2 - \(l)))2.

Proof. It will suffice to consider i = 1. We have

CT2j " b2ja2jb2j-lC2j-l + b2jC2jb2j+ia2j+l

(2) ,-l

3j = -a2jb2j-la2j-l’

c(2) = -c b'1 c Cj C2j 2j+lC2j+l’

II CT2 j H ^ II b2 ja2 j II II b2 J- 1C2 j- 1 II + II b2j+ia2j+l

!Rl“a2 j)~ 1 II * (1 - V2)_1 = 2/(2 - X),

, I, (2)-l (2) || ,| (2)-l (2) ||

4Hbj aj llllbj-l Cj-lH

^ ^ lid " o>2J)“1|| ||(1 - ^2j.2)_1|| (II ^2 ja2 j

(II b2 j- Ia2 j- 1 II II b2 j-2C2 j-2 I' )

s 4(2(2 - X))2(V4)2 - ( V (2 - X))2.

b2jC2jH * 2(X/4)

b2 j- 1C2 j- 1 II }

Thus X(2) s (X(1)/(2 - X(1)))2.

51

This may be compared to earlier results for constant diagonal matrices,

(2) 2 2

which gives an equality of the form 3V = a / (2 - 3 ) . See also Theorem 4.8 of [ V3 ] .

4.C.2. Relation to Block Elimination

By regarding odd-even reduction as a compact form of odd-even elimina- tion we are able to relate it to the Newton iteration for a"*. It is also

possible to express the reduction algorithm as block elimination on a per- T

muted system (PAP ) (Px) = (Pv) (cf. [311], [E3], [W2]). In Section 5 we

show how this formulation also yields quadratic convergence theorems.

Specifically, let P be the permutation matrix which reorders the vector T 0

( 1 ,2 , 3 , . . . ,N) so that the odd multiples of 2 come first, followed oy the 1 2

odd multiples of 2 , the odd multiples of 2“, etc. For N = 7,

P(l,2, . . . ,7)T = (1,3,5,7,2,6,4)T, and

This factorization gives rise to a related method, first solving L' z = Pv,

then U' (Px) 3 z. It is easily seen that z^ =

We may now apply any of the theorems about Gaussian or block elimina- T

tion to PAP and obtain corresponding results about odd-even reduction and the above modification. In particular, Theorem 3.2 applies though its

conclusions are weaker than those of Corollary 4.1. It follows from the

T 7 (i)

PAP = LL factorization as well as Theorem 4.2 that A will be positive

definite if A is; Theorem 4.2 also allows the conclusion that p(B[A^]) < 1.

54

The odd-even reduction is also a special case of the more general

cyclic reduction defined by Hageman and Varga [HI] for the iterative solu-

T

tion of large sparse systems. Since PAP is in the 2-cyclic form

PAP

and D? block diagonal, we can premultiply by

f 1 'A12D2

V^l0'!1

to obtain

A12D2 A21

It is clear that A

(2)

°2 ' A21°l A12>

°2 " A21°l A12 and (D1 " A12°2 A21’

0) are the rows

of ^ that have not been computed as part of the reduction. In consequence

of this formulation, if A is an M-matrix then so A P(B[A(1+1)]) < p2(B[AU)]) < 1 [HI].

(i)

and

The block LU factorization PAP = LAU, where A is block diagonal and L(U) is unit lower (upper) triangular, determines the reduction and back substitution phases. Norm bounds on L and U were given in Section 3. A. 2 for general matrices with || B[A]|| < 1. For block tridiagonal matrices in the natural ordering we have || L 1 1 ^ £ 1 + || CfAjHj , || L|| £ n|| C[A]||^ if || G[A] || x < 1, ||U|| S 1 + || B[A]||, || U||x * 1 + n|| B[A]|| if ||B[A]||<1,

n = maXj n ^ . For the permuted block tridiagonal system only the bounds on || L || and 1 1 U 1 1 ^ wi 1 1 grow, and then only logarithmically. By the construc-

tion of L and U,

55

|| L || s 1 + max || C[A(l)]|| ,

15iSm 1

INI * 1 +_ II C[A(1) ] ||,

i-1

Hull I*l+L II B[A(i> ] || ,

i-1 1

II U II s 1 + max || B[A(l) ] ||. l^i^n

Since C[A<'l)] is block tridiagonal, p^CTA^]) 5 2n || C[A ^ ] || , so || C[A('I') ] || 5 2n || C[A(l) ]|| L; similarly || B[A(l) ] || 1 5 2n|| B[A(l) || . If II B[A] || < 1, || C[A ] || < 1, we then have

|| Lllj * 1 + || CCAJHj

|| L|| 5 1 + 2nm|| C[A]||lf

|| U|| x 5 1 + 2nm|| B[A] || , || U || ^1+ || N[A] || .

Much like the proof of Corollary 3.4 we also have, if || B[A]||

> x <

SO

and

lib

(i+l)

tf-

(i+l)

j

b(i)n

b2j (1 a2j )

II B[A

(i)]2

b^}|| (1 + Si> , S. = || B[A(i)]2

II (bji+1)) “1 1|

... (i+l). cond(bj )

(b^V1!!/ (i - tj.

cond(b

(i) 2 j

)(1 + S.) / (1 - 3.).

56

Since our interest is in bounding cond(b^ji+^) must simplify the above expression. Let = II that

in terms of cond(b^) , we j=1(l + 3 j ) / ( 1 - Bj), so

cond(b^1+1))

5 [max^ cond(bk)]Ki

by induction.

Using |3

we obtain

Kis"1i-i(1 + 3j)/<1- ^.i>

= (i + s.)/[(i - spn^Jd - s.)]

< (1 + 3.)/(l - S^1.

This is a rather unattractive upper bound since it may be quite large if 3^ is close to 1. Actually, the situation is not so bad, since

II II ^ Jill on(^ -i^ll _

2 j

'll<fll B[A ] || = 2^i' and

cond(b^1+1)) < [maxj^ cond(bk) ] j _ ^

2+3.

If X(A) s 1 then || <j<J} || <; X(i)/2, \(l) = X(A(l)), and

cond(b^L+1)) £ [maxk cond(bk) ] 2 + X

(i)

2 - X

(i)

^ 3 ma3C[c cond(bk>

When A is positive definite it follows from various eigenvalue- interlacing theorems that cond(b*j^) < cond(A<'^’^) < cond(A).

4.C.3. Back Substitution

Now let us consider error propagation in the back substitution. Suppose yki+^ x,<i+1) + ?k^+^\ 1 ^ k S defines the computed solution to

. ( i+1) (i+1) (i+1)

A x a w

, and we compute

57

1

Then

*J- 1

(b^ )'1(W(1) - a' ' y'~ - . c

V 2j-l; V 2j-l 2i-l yi-1

.(i)

(i+1)

2J-1 yj-l

(i) (i+1).

2J-1 yj

+ e

2 j— 1 *

y(« , y(i+1>

y2j yj

§(1)

^2 j- 1

rw

' 2j- 1

.(i)

'2j-l

(b^ )"1 a^ |(i+1) _ (b^ c(i) e(i+1^

V 2j-l; a2j-l *J-1 ^j-l* C2j-1

+ e

(i)

2 j- 1 *

-(i) , -(i+1) a2 j -j

Now define the vectors

.(i)

.(i)

zl " <ei * ° «3 » °’

.(i+1)

(i+1)

(0, 5 > o, .

We have |j z^\\ - || *(i+^|| by construction, and

?(l) = B(l)z2 + zx + z2, B(l) =■ B[A(l)].

But in consequence of the construction we actually have

If || B[A] || <1 then || B(1)z,|| £ || B

§(L) || ^ max(|| B(l)z2 + zjl , || z2 || ) . (i)

z2 II < II Z2II

Suppose || ?(m) || £ e and || e(l)|| * e for all i =* l,...,m-l.

II ?(l) II s ™*<il B(i) || || z2|| + || z1||, || z2 || )

* II Z2II+ II ZJI

(i+Dn I, e(i)

s (m - (i+1) + 1) e + e (m - i + 1) e.

Then

(by induction on i)

58

and || || , the error in the computed solution of Ax = v, is bounded by

me, m => rlog2 N + l"! . This is certainly a satisfactory growth rate for error propagation in the back substitution. In fact, it is a somewhat pessimistic upper bound, and we can more practically expect that errors in the solution of x^+^ = w^+^ will be damped out quickly owing to

the quadratic decreases in || || .

4.C.4. Storage Requirements

On the matter of fill-in and storage requirements, we note that block elimination in the natural ordering of A, as discussed in Section 3. A, re- quires no additional block storage since there is no block fill-in. However, it has already been seen that odd-even reduction does generate fill-in. The

amount of fill-in is the number of off-diagonal blocks of A ^ , ...,A^m\ m

which is 2(N. - 1) < 2N. Since the matrix A requires 3N - 2 storage

~i=2 L

blocks, this is not an excessive increase. Of course, we have not cons id red the internal structure of the blocks, as this is highly dependent on the specific system, nor have we considered temporary storage for intermediate compu tat ions .

We do note, however, that if the factorization need not be saved, then

some economies may be achieved. For example, in the transition from A^

(2) (2) (2) (2) to A , the first equation of A x = w is generated from the first

three equations of A^x^ = w^ . The first and third of these must be

saved for the back substitution, but there is no reason to save the second

(2) (2) (2) (2) (2) (2) once Av xv = wv is formed. Thus the first equation of A^ xv = wv

may overwrite the second equation of A^x^ * w^ . Similar considera- tions hold throughout the reduction, but some modifications are needed on the CDC STAR-100 due to its restrictions on vector addressing (see Section 10. B).

59

The special scalings of odd-even reduction for systems derived from separable elliptic PDE's allow implicit representation of the blocks of as polynomials or rational functions of a single matrix variable.

Since the blocks of A are quite sparse this is necessary in order to limit intermediate storage to a small multiple of the original storage. Unfor- tunately, there does not seem to be a similar procedure that can be applied to nonseparable PDE's. The only recourse is to store the blocks as dense matrices, and on current computing systems this will limit them to small sizes .

4.C.5. Special Techniques

Now suppose that A = (-p ,t + p + q , -q ) , p ,q ,t i 0, p = q = 0;

J J J J J J J J IN

we would like to apply Babuska's tricks for the LU factorization (Section

3 .A. 4) to the factorization induced by odd-even reduction. This only par-

(2)

tially succeeds. First, we can represent A where

(2) (2) (2) (2) (2)

(-p . ,t. +p. + q . ,-q . )

j j j j j

pj

(2)

.(2)

.(2)

qj

P2 j b2j-l P2 j— 1 *

C2j + P2 j b2j-l *"2 j- 1 + q2j b2j+l C2j+1’ q2j b2j+l q2 j+1

Ck + Pk + V

(2)

Note that t^^ 5 £ b2j" tbe ot^er han^» the technique of avoiding

T

the back substitution by combining the LU and UL factorizations of PAP

_T

will not work. Let A' = PAP

(I + L) (D + U) = (I + U*) (D* + L*) , (I + L)f = Pv,

JU J. ^ ^

(D + U) (Px) = f, (I + U )f 53 Pv, (D + L ) (Px) = f , so that (D + D* - D[A' ]) (Px) =* f + f* - (L* + D(A' ] + U) (Px) .

60

The only way to have A' = L + D[A' ] + U is if no fill-in occurs in

ie

either factorization, in which case L = L[A' ], U = U[A]. In particular fill-in occurs in odd-even reduction, so the method does not work in its entirety. However, it will still be advantageous to do the reductions by developing the t^ sequences. (See [SI] for some examples.)

SchrBder and Trottenberg [SI] also consider a method of "total reduc- tion" for matrices drawn from PDE's and finite difference approximations on a rectangular grid. Numerical stability and further developments are discussed in [Sla], The total reduction is similar to odd-even reduction (which is derived in [SI] by a "partial reduction"), but just different enough to prevent the application of our techniques. Part of the problem seems to be that total reduction is more closely related to the PDE than to the matrix, though the algorithm is expressible in matrix form. Various convergence results are undoubtedly obtainable, but we have not yet deter- mined the proper setting for their analysis.

4 .D. Parallel Computation

Finally, the utility of the odd-even methods for parallel or pipeline

computation has been demonstrated in several studies [H3]. Stone [ S2 ] ,

Lambiotte and Voigt [L2], and Madsen and Rodrigue [Ml] discuss odd-even

reduction for tridiagonal systems, and conclude that it is the best method

to use when N is large (>256). Jordan [Jl] considers odd-even elimination,

which is shown in [Ml] to be most useful for middle ranges of N (64 < N < 256)

on a particular pipeline computer, the CDC STAR-100. For small values of

N the sequential LU factorization is best, owing to the effect of lower order

terms engendered by the pipeline mechanism.

The basic method given here is a block analogue of the method examined in [L2].

61

The inherent parallelism of the odd-even methods is in formation of

. . c , (i+1) ( i+1) ( i+1) (i+1) .

the equations of A x = wv or lL+^x = vv and in the back

substitution for odd-even reduction. It is readily seen from (4.2) that

H0 and v^ may be computed from and v^ by

compute LU factors of b^, (1 £ k £ N) solve bk[^ ck vfc] - [ak ck vk], (1 * k £ N)

- bk ' Vk- 1

- Vk+l’ (1

s k £ N)

<2>

/V

*" Vk ' ak\-l

- Vk+i’ a

£ k £ N)

42)

- -Vk-1’ (3

s k £ N)

<2)

** "CkCk+l (1

^ k s N - 2)

(Special end conditions are handled by storing zeros strategically or by

(2) (2)

shortening vector lengths slightly. A ' and w ' are computed by only gen- erating new equations for even k; the back substitution is then rather

(2)

straightforward once x is computed.

The algorithm for odd-even reduction, N = 2m-l, all blocks n X n,

* - 1, is as follows: for i = 1 m- 1

compute LU factors of b^j^ , , (0 ^ k ^ N. ,)

Z k.+ 1 l+l

62

solve b$m) x|m) = wjm) for i = m- 1 , . . . , 1 solve

soive D2k+1 x2k+1

(w(i) - a(i) x,(i+1)

' 2k+l a2k+l *k

c

(i) (1+1)

2k+l k+1

) , (0 £ k * Ni+1)

The ith reduction and corresponding back substitution uses 2 3 2 3

n + 0(n ))N^+^ 12 n arithmetic operations, and the total operation

2 3 2 3

count for the algorithm is (12-jN - 12m)n + 0(n N + n ) . This may be com-

2 3 2 3

pared to 4-jNn + 0(n N + n ) operations for solution of Ax = v by the LU factorization .

For a crude estimate of execution tine on a pipeline computer, we take tn + a, t « a, as the execution time for operations on vectors of length N, and t, r « t « cr, as the execution time for scalar operations

[H3]. The total execution time for odd-even reduction would then be

232 3 2323

roughly (12— n + 0(n ))(tN + am) + 0(n t) versus (4— n N + 0(n N + n ))t

for the LU factorization done completely in scalar mode. A qualitative comparison shows results similar to the more precise analysis given to tridiagonal systems: for a fixed value of n, odd-even reduction will be- come increasingly more effective than the LU factorization as N increases. Odd-even elimination should maintain its utility for moderate values of N; the factor tN + am is replaced by (tN + <7)m, but data manipulation is often much simpler than in odd-even reduction. The actual crossover points be- tween algorithms are implementation dependent and will be considered to a limited extent in Section 10. C.

A critical observation on the execution time of odd-even reduction is that roughly 50^ of the total time is spent in the first reduction from *

A more detailed timing analysis is in Section 10. C and Appendix B.

J

63

A^ = A to . Special efforts to optimize the algorithm by solving the reduced systems differently must take this fact into account. Further- more, in timing each reduction it is seen that roughly 60$ of the high- order term n^N comes from the matrix multiplications used to compute A^i+^ and 30$ from solving the 2n+l linear systems for a„, , etc. Relatively

4. Kr L

little effort is used to factor the b^'s or to make back substitutions

J

having saved the factorizations.

The use of a synchronous parallel computer creates some constraints on implementation which are not so important on a sequential computer.

The first restriction in the interests of efficiency is that the blocks of A should all be the same size. This allows simultaneous matrix operations (factoring, etc.) to be programmed without resorting to numerous special cases due to varying block sizes. In addition, pivoting causes ineffi- ciencies when the pivot selections differ between the diagonal blocks.

As with the LU factorization, the assumption j| 3[A]|| < 1 only guarantees that no block pivoting is needed. If A is strictly diagonally dominant or positive definite, then no pivoting within the blocks will be required.

Thirdly, the choice of data structures to represent the matrices and vectors must conform to the machine's requirements for the parallel or vector operations. This is an important problem on the CDC Star owing to its limitations on vector addressing and allocation [LI]. When A is restricted to having small blocks this is not as serious as it would be otherwise. In the case of the Illiac IV, the problem of data communica- tion between processing elements must also be considered.

64

4 .E. Summary

We now summarize the results of this section. The odd-even elimina- tion and reduction algorithms may be safely applied to the solution of certain block tridiagonal linear systems, and in some cases various impor- tant quantities involved in the solution decrease quadratical ly to zero. The quadratic convergence may be explained by expressing the methods as Newton-like iterations to compute a block diagonal matrix. Variations of the basic methods, which have important applications to PDE's, also dis- play quadratic convergence. Parallel computation is a natural format for these methods, although practicalities place some restrictions on the type of system that can be solved efficiently.

65

5. UNIFICATION: ITERATION AND FILL-IN

Having presented several linear and quadratic methods, we now treat them in a unified manner as particular instances of a general nonstationary iteration. We are also lead to an interpretation of quadratic convergence in terms of matrix fill-in, and conclude that the odd-even methods are the only instances of the general iteration which display quadratic convergence. These results apply to arbitrary matrices satisfying |j B[A]||< 1 and not only to the block tridiagonal case.

Let us first collect the formulae for the block LU factorization, block Gauss- Jordan, and odd-even elimination.

LU: A.+ 1 = (I - L[Ai]EiD[Ai]“1)A1, Aj = A.

GJ: A.+1 = (I - (L[.\.] + UrA.])E.D[Ai]‘1)A., Aj = A.

OE: Hi+1 = (I - (L[H.] + ^H. , ^ = A.

Now consider the function F(X,Y,Z) = (2 1 - XY)Z, and observe that

Ai+1 = F(D^Ai^ + UAi]Ei, DCA.f1, A.)

Ai+1 - F(D[A.] + (L[At] + U[Ai])Ei, DfAj'1, A.)

Hi+1 = F(Hi’ D^Hi^1» Ht)

Thus it is necessary to analyze sequences based on repeated applications of F.

For fixed nonsingular Y we can define the residual matrices r (M) = I - MY, s(M) * I - YM, so F(X,Y,Z) = (I + r(X))Z and

r (F (X,Y,Z) ) - r(Z) - r(X) + r(X)r(Z), s(F(X,Y,Z)) = S (Z) - s(X) + s (X) s (Z) .

66

By taking X = Z and assuming that || r(Z1) j| < 1 or || s(Zj) Jj < 1, the sta- tionary iteration 3 FlZ^.Y.Z^) converges quadratical ly to Y This

is, of course, the Newton iteration.

The related operator G(X,Y,Z) = Z + X(I - YZ) 3 Z + Xs (Z) has pre- viously been studied as a means of solving linear systems [H8]. When Zl 3 X, Zi+^ 3 G(X,Y,Z^), Bauer [B8] observes that r(jL + ^) 3 r(X)r(Zi>, s(Z1+1) 3 s(X)s(Zi), Z. = (I - rCXiSY"1 3 Y_1(I - s(X)S, Zi+R = G^.Y.L), and in particular Z9^ = G(Z^,Y,Z^). This last formula also leads to the Newton iteration, for G(Z,Y,Z) * F(Z,Y,Z), s(Z9i> = s(Z.)2.

In this section we study the nonstationary iteration Z ^ - A,

Z.,. = F(X.,Y.,Z.), where Y. = D[Z,] ^ and X. consists of blocks taken

i+l ill i L iJ i

from Z^, including the diagonal. The selection of blocks for X^ is al- lowed to change from step to step, but we shall assume that the selection

sequence is predetermined and will not be altered adaptively. (The cost

of such an adaptive procedure, even on a highly parallel computer, would be proh ib i t ive . )

For the solution of a linear system Ax = v we take (Z^,v^) = (A,v),

(Z^+^, = (21 X^Y^) (Z^,v^) , with the sequence ending in or approach-

* 1 * ing (Z ,v ) , where Z is block diagonal. Then x = Z v . One necessary

condition for consistency is that 21 - X^Y^ = I + ri(X^) be nonsingular,

which holds if and only if p(r^(X^)> < 1. Since r^X^) and s^(X^) 3 I - Y^X^

are similar we could also assume p(s^(X^)> < 1. Using earlier notation,

ri(Xi> 3 C[X^], s^(X^) 3 B[X^]; the notation has been changed to provide

greater generality. In the context of an iteration converging to a block

diagonal matrix, the condition || B[A]|| < 1 (implying || s^X^) || ^ ||s^(Z^) || < 1)

states that A is sufficiently close to a particular block diagonal matrix,

namely D[A].

67

We first generalize earlier results on linear methods C ||s ^ (Z J j < 1 || si+i^Zi+x^ H 5 II s^(Z^) II ^ and t,len determine conditions under which we obtain a quadratic method (|| || < 1 || si+1(Zt+1> || £ || si(Zi) ||2) .

Theorem 5 . 1 In the sequence Zi+1 = F^.Y^Z^ , suppose that

{CP ,P) | 1 ^ P ^ n} C Tt C f(p ,q) | 1 ^ p,q * N} = U,

X. = L E Z.E , Y. 3 D[Z. ]_1, i. i pio i L i J J i

si(M) = I - YJ1.

Then || s^CZ^) j| < 1 implies

II H * H II * H H -

Proof. Suppose || s^Z^ |j < 1. Define Q by Zi+^ = Z^ - DfZ^Q, so Q = -Y.Z. + Y.X.Y.Z. = -s.(X.) + s.(X.)s.(Z.). Since D[s. (X.) ] = 0,

S = D[Q ] = D[s1(X1)st(Z1)]. Since s.(X.) = ^ . EpS .(Z.) Eq , || s.(X.) || * ||s.(Z.) || and || S || £ || si(Zi) ||2 < 1. Thus D[Zi+1] = DCZ±] (I - S) , Y.+1 = (I - S)"^., and s^CZ^^) = (I - S)s.+^(Zi+^) + s* Lemma 2.2 applies, so that ||s^(Z^ .) || < 1 implies || s.+1(Z1+1)|| s || st(Zi+1)||. Bue s.(Z.+1) - S.CZj) + Q -

si<Zl> * si«i> + si<Xi)si<Zi> WiC,*, + ^.Vi'Wi'V’

r = U - T^ so

+ o.(L E s. (Z.)E ) j TL p iv l q'

^ Dj(si(Zi)) .

Si(zi)

Thus || s.(Z.+1) || £ || s.(Z.) || < 1.

QED.

68

To see when the method is actually quadratic, we first establish that

2

at least some blocks of s^+^(Z^+^) are bounded in norm by || s^(Z^) ||

Corollary 5.1. With the hypotheses of Theorem 5.1,

L E s. ,(Z., . ) E I £ I s,(Z.) T. p l+l i+l q 1 1 i i

Proof. From

(5.1) (I - S) s.+1(Z. + 1) + S = s.(Z.) - s.(X.) + s.CX^s.CZ.),

S = D[si(Xt)s.(Zi)],

and E (I- S) = (I- S)E , we obtain P P

(I - S) L E s . , . (Z. .)E + S

T. p l+l 1+1 q

\EP si(zi,E, - \ep Sl<Xl>E«

l+ +.EP WdW

But .i(Z1)Eq - s.(X.) - 31(Xi)Eq,

so (I - S) Sj Ep s1+1(Zl+1)Eq Sj Ep s.tx^ sl(Z1)Eq - S. By Lemma 2.1,

Vp S1S-1(Z1+1)E,II S 'I Vp WS1(Z1)EJ S II sl<Xl)Sl<Zl)H S 1K<V

In certain cases Corollary 5.1 provides no useful information, since

we could have 2LE s. ,(Z. .)E =0. Block elimination is one example

i ^ p l+l l+l q

where this occurs. However, it is seen that if we are to have |js^+^(Z^+^) ||

2

£ || s.(Z.) || then we must have

(5.2) llEj.Ep st+1«1+1)Eq|| 5 II II2.

This is certainly true if T^ ia the null set, in which case we are generating

69

the sequence, so we shall assume that is non-empty and show that (5.2) will not hold in general.

We set one final condition on T\ for notational purposes, though it

turns out to have practical consequences: if it is true that E Z E = 0

p l q

for any starting matrix Z^, then (p,q) must be in T.. For example, we now write the Gauss- Jordan algorithm as

A. . = F(D[A.] + (L[A.] + U[A.])(Ij=1Ej) , D(A.]_1, A.).

i+1

Inclusion in of all blocks in Z^ known to be zero clearly will not affect the sequence Z. , = F(X.,Y.,Z.). With the additional condition

i+l iii

on T., it follows that if (p,q) £ T. then E Z.E can be zero only bv

i i p i q j j

accident, and not by design.

Corollary 5.2. With the hypotheses of Theorem 5.1, only the choice = 6 yields a quadratic method for arbitrary starting matrices Z^.

Proof . Suppose that T1 ^ 6. Following the proof of Corollary 5.1, we

obtain

(I - s> Vp si+i<zi+i)E,

?L,E s . (Z . ) E + IL,E s.(X.)s.(Z.)E p iv l q p i l l l q

which implies, by Lemma 2.1,

VP =i+i<zi+i)Eql

Er;Epsi(zi)Eq + s + sr;Epsi(xi,si(zi)EP'

1 1 r r

The presence of nonzero 1st order terms (2^,, EpS ^ (Z^)E^) in general dominates

the 2nd order terms (S + £_,E s . (X.)s . (Z.)E ), and we cannot conclude that

T'. p i i i l q

(5.2) will hold for arbitrary Z^ satisfying || s ^ (Z ^) | j < 1.

QED

70

Now let us consider what happens when block fill-in occurs, as it

will during the odd-even algorithms for block tridiagonals. For the off-

diagonal (p,q) block, p * q, we have E Z.E = 0 but E Z. ,E ^ 0. Since

p l q p 1+1 q

Zi+1 = 2Zi - Wi’ EpZi+lEq = EpXi('YiZi)Eq = EpXi Si<Zi)Eq’ and

II EpZi+lEqH * H EpXiH II Si(Zi)EqH- From <5a) and EpSEq = °» we have

(I - S) Ep Si+l(Zi+l)Eq = Ep Si(Xi} Si(Zi)Eq* WhiCh imPUeS II Ep Si+l(Zi+l>Eql

£ S + E s . (X.) s . (Z.)E s 11 p l l 11 q

Si(Zi)

(We cannot conclude that

II Ep Si+l(Zi+l)EqH * II Si+l(Zi+l)ll^ however->

As shown by these inequalities, fill-in produces small off-diagonal blocks of Zi+1 when || s^(Z^) || < 1, and the corresponding blocks of are quadratically small with respect to s^Z^). Thus one source of quadra- tic convergence is from block fill-in during the process Z. , = F(X.,Y.,Z.).

l+l v i i l

In summary, we now have two explanations for quadratic convergence in the odd-even methods when applied to block tridiagonal matrices. The first is that the method is a variation of the quadratic Newton iteration for A \ The second is that, in going from to H^_^, all the non-zero off- diagonal blocks of are eliminated and all the non-zero off-diagonal blocks of H. , arise from block fill-in. Each block of BTH. is there- fore either zero or quadratic in 3[H^].

71

6. HIGHER-ORDER METHODS

For Poisson's equation, where A = (-1, b, -I), the choice N = 2™ - 1 leads to considerable simplification in the Buneman odd-even reduction algorithm. The most important properties are A^ = (-1, b^\ -I) (constant block diagonals) and the expression of as a polynomial in

b. However, in some applications other choices of N are much more natural. Sweet [S10] therefore described a generalization of odd-even reduction which could handle these cases and still preserve constant block diagonals and a polynomial representation. Bank's generalized marching algorithm [B5] for Poisson's equation uses one step of (very nearly the same) gen- eralized reduction in order to control the stability of a fast marching algorithm. This technique is analagous to multiple shooting methods for ordinary differential equations.

In this section we analyze Sweet's method as yet another variation of odd-even reduction, and show that it possesses higher-order convergence properties. The algorithm actually considered here yields Sweet's method through a suitable block diagonal scaling, and hence our results carry through much as before. It must be emphasized, though, that the methods of this section are not intended as practical methods for general block tridiagonal systems. Rather, our purpose is to elaborate on the quadra- tic properties of odd-even reduction by showing that they are special cases of convergence properties for a family of methods.

6. A. p-Fold Reduction

The idea behind the generalization of odd-even reduction is to take

a block tridiagonal linear system involving the unknowns x^,x9,...,

and

72

to produce a new, smaller block tridiagonal system involving only the unknowns x ,x„ . Once this system is solved, the knowns x ....

p 2p J p’

are back- substituted into the original equations to compute the remaining unknowns. The entire process will be called p-fold reduction; odd-even reduction is the case p = 2.

Let p S 2 be a given integer, and define = 'Jj/ pj . Starting with

(2) (2) (2)

Ax = v we generate the reduced system A x = w , where

A(2) = (aj2)> bj2)> cj2))x2 ' x<2) = <xpj>N > and the diagonal block b^2)

has dimension n.. For 1 s k s L + 1, define PJ 2

bkp-p+l ckp-p+l

akp-p+2 ^kp-p+2 Ckp-p+2

akp- 1 kp-

When kp- 1 > N, either extend Ax = v with trivial equations or let be redefined as a smaller matrix using the available blocks of A. Next, re- partition A to have diagonal blocks

V V V b2p bN2p* \+i

and call the new system Ax = v. Three adjacent block rows of K are thus

(0

The p-fold reduction is obtained by performing a 2-fold reduction on A, and the back substitution is obtained by solving the system

By taking

\p-tH-l 'kp-cH-l *kp-p+i\

v\p- 1

kp-1

i- V

"kp- 1 /

kp- p+1

I 0

kp-p+l'

c v

kp- 1 kp- 1

the reduction step simp lies to

(6.1)

■T

f’

, (2)

'k

42) = \p akp Vi "kp Vfl’

akp °kp- 1

bkp akp "kp-1 Ckp \p+l

Ckp ^kp+l

and the back substitution becomes

74

*kp-j °kp-j °kp-j *kp-p " Vkp-j ^kp’

1 * j * P-1.

Rather than computing the entire a and y vectors, a more efficient technique is to use LU and UL processes on the A blocks:

for each k = 1,2,...,N2 + 1,

(LU process on A^)

^k, 1 = akp-p+l ^k.l = bkp- p+1

^,1 Vkp-pH-l’ for j = 2 .... ,p- 1

solve tAk, j- 1 Ck,j-1 Fk, j- 1 ^

'^k,j-l Ckp-p+j- 1 \,j-l^

\,j ^P-P+j Ak,j-1

"k.j bkp-fH-j akp-p+j Ck,j-1

■\,j Vkp-fH-j akp-p+j Fk,j-1

SOlve ^.p-l^p-l \p-l \p-ll

'°Tc,p-l Ckp-1 "^k.p-l"

75

for each k = 1,2,...,^,

(UL process on A^+^)

3k+l,-l " bkp+p-i; Vk+1,-1 = Ckp+p-l’

^+1,-1 Vk]>4-p-l

for j = 2, . .. ,p-l

solve ®k+l- j+l °k+l,-j+l Fk+l,-j+ll

r akp+p- j+1 vk+l j+1 ’\+l j+1 ^

B = b c A

k+ 1 , - j kp+p-j kp+p-j H<+l,-j+l

y

Ckp+p-j Ck+l,-j+l

k+l,-j

co. v c F

_k+l,-j kp+p-j kp+p-j k+l,-j+l

solve ^kp+l r\pH-l^

^•^.p+l Vk+l,-p+l C°k+ 1 , - p+ 1 -

The block tridiagonal system A^x^ = w^ is now generated by (6.1) and solved by some procedure, and we take advantage of the earlier LU process for in the back substitution:

for each k = 1,2 N2 + 1,

^kp-l ^kp-1 \p-l *kp-p " Vkp-1 \p for j = p-2 1

’'kp-jH-j Fk,j \,j *kp-p " Ck,j *kp-p+j+r

6 . B . Convergence Rates

As observed in Section 2, if j) B[A]||< 1 then, under the K partition- ing for p-fold reduction, we have || B[£]|| £ || B[A]|| . By Corollary 4.1 it follows that p-fold reduction will yield || A^ || £ || ff|| = ||a|| and II B[A( ^ ] || s || B[£]^ || s || B[A]||^. In fact, we can provide pth order

L

76

convergence, although our first result is somewhat weaker than might be expected in comparison to earlier results.

We will actually consider a natural generalization of odd-even elim- ination. Let Mp be a nonsingular block (2 p— l)-diagonal matrix,

M ( m , _,..., m m . , ) .

P J j'P+1 J,0’ j,p-l N

We want to pick M such that P

M A = (r . , 0, ..., 0, s., 0, . . . ,0, t)

p J j j

C2)

MpA is a block (2p+l) -diagonal matrix; A is constructed by selecting the intersections of the block rows p,2p,..., and block columns p,2p,...

Thus we have

(2) , (2)

a = r , b x = s j JP j jP

c<2) = t. , W<2) - (M v). . J JP J P JP

Now, the jth block row of M A is

P

~^k=- p+1

m. [block row (j+k) of A], j > k

so that

rj = mj ,-P+l

sj = mj,-icj-i + mj,obj + raj,+i Vi

ti = mj,p-i cj+P-i'

We require that

mf, , c . , , + m, , b , . + m, , , a, , , - 0,

j,k-l j+k-1 j,k j+k j ,k+l j+k+1

I k | = 1 >2 , . . . ,p- 1 ,

77

where we take m. = 0, m. =0.

J.-P J.P

The special case p = 2, = (-a^b^ ^,1,-c^b^^), has been considered

in Section 4. A. For p > 2, define

d. = b.

J.-P+l J-p+1

dj,-k

‘‘j.-k

j,k

Vj,k

- b . . - a., d . ^ , ,c., ,,

j-k j-k j ,-k- 1 j-k- 1

k = p-2,...,l aj~k d j , - k- 1

k = p-2, ... ,0

= b.

j.P-1 j+P-1

|) « p rj ra

j+k j+k j , k+ 1 j+k+ 1 k = p-2 ) t * « i If

= -c d_1

j+k j,k+r

k = p-2 0 .

Then we mav cake

j,-k ^j.O “j.-l “j.-k+l

a .

1.0

m., =v. .v. ^ ... v ,

J »k j,0 j,l j , k- 1

k = 1 ,2 , . . . , p- 1 .

The d , 's are derived by an LU factorization process, and the d 's J>“K- j >+k

by a UL process. It follows from earlier results that if || C[A]||^< 1 then [j m^ s || , so the diagonals of decrease in size as one

moves away from the main diagonal.

Our first theorem partially generalizes Theorem 4.1.

78

Theorem 6.1. Let J = (-s.^r., 0, .... 0, -s.^t.), B = (-b.^a., 0, -b.^c,),

JJ J J J J J J

B =■ L + U, J = JL + Jy, II L || S a, II U || ^ a, || B || s 2m = 8, with * s

Then || J|| * 8?, ||jl|| s , || JyH * ±3>P .

Proof. Let RQ = 0, ^ = L(I - R^'V TQ = 0, Tfc = U(I - T^)'^,

k=l,...,p-l, S=R i+T , , so that

p- 1 p- 1

(I - S)JL = L (I - Rp_2)"1L(I - Rp_3)"1L ... L(I - R0)-1L,

(I - S)JIT = U(I - T „)'1U(I - T ,)-1U ... U(I - T.)-1U.

U p-Z p- j 0

2 - 1

Define '{r = 0 > t, = a (1 - i, .) , so

0 k k- 1

Kii* V llTklls V

||JLI|, II Jyll^oP/td- 2Vl)7^ (1- V], * 0p/[2p_I(l - 2* ^Ijd- tk)].

Define 50 = °’ 51 = l’ 5k+l = 5k ' * 6k-l’ 30 6k = det(a'1>a()(k-l)x(k-l)> °*

Vk = 3 V 5k+l’ 1 ' *k = 'k+2 'k+1’ (1 " 2ip-l) “k=0(1 " V

(1 - 2 i ) 5 = 5

Define T = 5. l1 - a 5 so t = 1,

p-1 p p+1 p-1 k k+1 k-1 L

T2 ~ 1 - 2 or*" , ~k = Tk-1 - a2Tk_9. By induction, Tk(-) = 2 k+1,

d or = -2nk6k ^ £ 0. 3ut then, for m S 4, ~k(») 2 Tk(-|) , so 2P («») 2 1.

Thus || J|| s SP and || jJ| , || jJ| * j3P.

QED.

Corollary 6.1. If ||B[A]||<1, ||l[B[A]]||, || U[B[A j ] || * \ || B[A] || , then || B[A(2)]|| * || B[A] || P.

In generalization of Theorem 4.4, there is the very surprising

Theorem 6.2. With X(l) = X(A(l)), X(1) SI, y.(l^ = (l-./l- X(l) )/ (1+. 4- \(i)> ,

we have

79

\(i+D ^ X(i)PfJ_±-i£

Proof . It suffices to consider i = 1, so we xjill drop the superscripts. Define e. = b ^ d. , T] . = b . ^ d , k = 0,...,p-l, so that

J,-k j-k j ,-k j > k J+*^ J ,k

= (-a, a5

j j,-l'v j-l “j-l 3,-2

),

.-1 . , ,-l -1 .

j- P+2 a j-p+2 '’j.-p+l

X (b"1 . a. ,,

J-p+1 j-p+1

(2) = r. ,

a JP

3

t. = (-c . 71 . 1 .) (-b . 1 c

J J J,1 J+l J+l 3,2

X (b- 1 , a . ,) ,

v j+p- 1 j+p- 1

,-1 -.-1 j+P- 2 C j+p-2 j,p-l

c(2) - t. ,

J JP

h(2) - « b , ■“ s . j

j jp

v - 1 , -1 1 , - 1

s . = b . - a. e . b. , c. , - c. T. b a

J J J J.-l J-l J-l J J.l J+l J+l

= b . (I - a.) .

J J

Now

11 ^p-i11 3 1# ii 7ij!kii s (i xit ^jlk+iii /A)"1- By definins ei = i.

Ek = (1 XEk-l/4)"1> We haVe II ^Jp-kH * V Similarly, || * Efc.

This yields || c || * XE^/2, || (I - CTj)-1|| £ 2/(2 - \E ^ . To bound \(2) observe that

4||b<2>-1a<2)ll II bP)-1c(2> "

4 I) (I - a. ) JP

j-l j-l"

'ln " \1 a. II II e:1 J| || b} , a.

jp JP 1 1 JP.-l 1 JP-1 JP

x [l ejP.-2,! ••• 'jJ.-pU11 11 'Wl “jp-P+11

x 11 (1 w

-1

--1

X I 71

-1

JP-P.2 1

b” 1 c .

JP-P JP-P

JP-P»P“ 1 II II b JP- 1 C JP- 1 1

-1

jp-P.l" 11 jp-P+1 jp-P+1

s 4

( 2-A (hp nr;E2.

V2 - XE J 4 j-l j

x p- 1

80

(2)

Thus is bounded by the above quantity. Suppose first that X - 1,

which implies that y, = 1. Then E^ = 2n ' (n+1) , the bound simplifies to

XP, and we are done with this case. Now suppose that X < 1. It may be

shown by induction that E^ 3 (1 + u.)(l - u.n) (1 - y,n+^) , yielding

(2/(2 - XE ))~ = (1 - X) ^((1 - ilP) (1 + u,P))""> and with a little effort P” i

the bound simplifies to the desired result. QED

Note that we have not actually shown that X^ £ 1 implies ■£ 1.

This point is now to be resolved.

Corollary 6.2. X^ < 1, and

If

(i+1)

u

s: 1 then \(i+1) (i) p

, (i)P . ,

£ \ with strict

inequality if

Proof . The case X^ = 1 was considered in Theorem 6.2. Suppose that 0 \ < 1, again considering only the case i = 1. It is not hard to show

1 Lp/2J i

that ((1 + n),/2>P(2/(l + yP)> = (7. (X))" , where n (X) = (P.)(l '

p P ~i=0

Now, II (1) = 1, 7' (X) < 0 for 0 £ \ < 1, so n (X) > 1 for 0 ^ \ < 1 and P P P

X^“^ < XP for this case. Assuming only \ £ 1, and since \(1 + y,)-/4 = u, we have X^ S y,p(2/(l + y,P))% which implies y/^ ^ y,P . QED

6.C . Back Substitution

Error propagation in the back substitution for p-fold reduction de- pends on the particular implementation. In the simplest form

*kp-j ^kp- j \p-j *kp-p ^kp-j Xkp

errors in

*kp- j

. depend on newly generated rounding errors and propagated

errors from x, and x, . We can write the computed solution as kp-p kp

y, , = x, . + %. , to find

ykp-j Kp- j kp- j

81

'kp-j ^p-j 'kp-p 'kp-j ^kp + ekp-j"

It follows from the analysis of the Gauss- Jordan algorithm (Corollary 3.5) and from explicit formulae for \p_j> j> that || l| £ || B[A]|f J,

II \p-jH S II B£A]|P if II BfAlll< which leads to

II Skp-jH * H »[AiirJll Vp«+ II alt’ll 5kPl,+ II V-jll-

Errors in a newly computed component therefore depend most strongly on errors in its closest back-substituted component, and the "influence" decreases as the distance between components increases. Moreover, we have

II §kp-jl( s II B[A]H VpH’ II Spl^ + II V-jll*

If e is a bound on errors introduced at each stage, it follows that errors

in the final computed result will be bounded by e lo? N.

P

Slightly different results prevail in the back substitution

^p-l ^kp- 1 \p-l ^D-p ’Vp-l ^p for j = 2 , . . . , p- 1

‘kp-j Fk,p-j Ak,p-j Xkp-p Ck,p-j ’Scp-j+l

Errors in the

computed solution

' kp-j

*kp-J + "kp-j obey the rule

5? pm

skp-l 'kp-l "kp-p " Vkp-p "kp + Skp-1

for j = 2 , . . . , p- 1

"kp-j " -Ak,p- j "kp-p " Ck,p-j ^kp-j+1 + €kp-j’

so the errors in x depend on propagation from x. and x, , , rather KP J xp-p kp-j+1

than x^p.p and ’Stp’ This iroplies a linear growth as in the LU factorization

82

in conjunction with logarithmic growth as in odd-even reduction. If we again suppose that || e^^jj <: e, with || ?kp_p |(, || ckp|| * ?, it follows that

I! ?Up_ ! II * II B[£]|| ? + e,

II Skp.jH * || B[Xj || max(5, || lkp_j+1|| ) + e

which leads to || ?kp_j|| s % + (p-l)e. The end result is that errors in the final computed result will be bounded by ep log N.

83

7. SEMIDIRECT METHODS

Briefly, a semidirect method is a direct method that is stopped be- fore completion. In common usage, a direct method for the solution of Ax - v computes x exactly in a finite number of steps if exact arithmetic

is used, while an iterative method computes a sequence of approximations

( i) *

x converging to x. The semidirect methods are intermediate on this

scale, producing an approximation to x in a finite number of steps,

while an additional finite number of steps could produce x exactly. During

the computation, intermediate results are generated which approximate the

solution. If the error is small enough then the algorithm is terminated

prematurely and an approximate solution is returned.

The method of conjugate gradients is an example of a semidirect method since it can produce a good approximation long before its usual termination [R2]. The accelerated Parallel Gauss iteration ([H4] and Section 8.E) is another example, as is Malcolm and Palmer's convergent LU factorization [M2] of A = (a,b,a) , a and b positive definite with o (b *a) < -7.

Here we consider semidirect methods based on the quadratic conver- gence properties of the odd-even methods. This will generalize various results obtained by Hockney [H5], Stone [S2] and Jordan [Jl]. Buzbee [B17] discusses a similar technique based on the Buneman algorithm for Poisson's equation (cf. Theorem 4.3).

The name semi- iterative has already been appropriated for other purposes [V3], Stone [S2] initiated the name semidirect.

34

7 .A. Incomplete Elimination

Let us first consider odd-even elimination for Ax = v, which is the process H1 = A, v(1) = v, H = (I + C[H.])H. , v(l+1)= (I + C[H.])v(l'1 . In Theorem 4 . 1 we proved that if |j B[A]||< 1 then || B[Hi+^]|| ^ || B[H^]“ |j, so that the off-diagonal blocks decrease quadratical ly in magnitude rela- tive to the diagonal blocks. This motivates the approximate solution z^^ = D[1L] ^v^\ Because we halt the process before completion, we call this technique incomplete odd-even elimination.

Theorem 7.1.

x - z*-1'1 II / II X |

B[H.]||.

Proof . Since v^ = H.x = D[H^](I - B[H^])x, we have z^

x - B[H.]x.

D[H.]_1v(i)

QED

Thus the B matrix determines the relative error in the simple approxi- mation z ^ ^ . If 3 = || B[A ] || < 1, 0 < e < 1 , m = ! log N” , then

(7.1) k= 1 + max(0, min(m.

log

l°g9 e 2 V log ?

))

guarantees |j B[R^]|| £ e. For example, if

a =

f20 = 10-6, then

k = 6 and we should have N > 32 for incomplete odd-even elimination to be

2^ 32 o ^

applied. In fact, |j B[Vl^] || £ 3 =2 = 10 ”, so the results of the

incomplete elimination would be much better than required.

In the case n 1, a^ = c^ = a, b^ - b, we have || B[H^+^]|| *

II B[Hi ]2 }| / (2 - || B[H.]2|| ). Since x2/2 < x2 / (2 - x2) < x2 for 0 < x < 1,

k must be larger than (7.2) 1 + log

log2(e/2)' 2l log2(3 2)

85

in order to have || B[H, ]|| £ e. (By induction, 3 > 2(9/2} , and if

2k_ ^

2(3/2) > e then 3 > e.) A more refined lower bound is obtained

2 2

by defining t = 3, t . ~ T (2 - t ) , and computing the largest k such i rtf- in n

(7.3) xR > e .

Since B[1L ] is the matrix associated with the block Jacobi iteration for H^x = v^ , when || B [ H^. ] | j is small we can use this iteration to improve the estimate . pn fact, the choice z^ = of h ] * is the iterate

- iC

following the initial choice z ^ = 0. If z^k’^ is an approximation to

x and z(k»1-+1) = B[Hk]z(k,l) + D[Hk]’1v(k) then || x - z(k’^|| <:

|| B[H, ]^|| || x - z(k’°^ || . k and l may now be chosen to minimize computa-

2^ 1 £

tion time subject to the constraint 3 £ e. Any of the other standard

iterative methods may be used in place of the Jacobi iteration; for related results using the theory of non-negative matrices and regular splittings,

see [HI],

We have remarked earlier that if the tridiagonal matrix A = (a., b., c.)

j j J

is irreducibly diagonally dominant and 9'

B[H^]|| then 3^ £ 1 and

^ 9<‘i'1 2 . This is a case of practical importance in differential equations. However, the example A = (-1, 2, -1), for which 9^ = 1,

1 £ i ^ m-1, 3^ = 0, shows that strict inequality is necessary for incom- plete elimination to be effective.

Indeed, when 3^ is very close to 1, as will be the case for most second-order boundary value problems, the fact that we have strict in-

equality does not imply that 3V will be small for i < m. With regard to the particular approximation z^ = D[H^] v^\ z ^ ^ depends only on v £, j - (2i_1 - 1) £ l s j + (2 1-1 + 1) . Thus for N = 2m - 1, i < m,

86

( i.)

the center point 2^m ^ will be independent of the boundary data. Al- though this is not the only possible approximation, complete odd-even elimination will be necessary to ensure convergence to the continuous solution .

7 ,B . Incomplete Reduction

The incomplete odd-even reduction algorithm is analagous to incomplete odd-even elimination. Complete reduction generates a sequence of block

tridiagonal systems A^x^ = w , i = l,...,m, m = ~log? N+ 1~ , solves

, (m) (m) (m) , , (m) , , , . ,

Ax - w exactly for x , and back substitutes to finally obtain

(1) . , . , (k) (k) (k) ,

x = x . Incomplete reduction stops with Ax = w for some k be-

Ck) (k)

tween 1 and m, solves this system approximately for y *3 x , and back substitutes y^ to obtain y = y^ rs x^ .

We remind the reader of observations made in Section 4.D, especially that roughly half the actual work is performed in the first reduction, one quarter in the second, and so forth. Thus the savings in incomplete reduc- tion would be much less than those in incomplete elimination, where the work per step is roughly constant. For more details, see Section 10. C.

In contrast to incomplete elimination, the incomplete reduction must deal with error propagation in the back substitution. We have already shown that roundoff errors in complete reduction are well-behaved, and can only grow at most logarithmically. But since the approximation errors quite possibly might be larger than the roundoff errors, a different analy- sis is needed. We will see that a number of interesting properties result, including no growth of errors under certain assumptions about roundoff,

*

This observation is due to an anonymous referee of [H2].

87

and a superconvergence- like effect due to damping of errors.

An initial approximation y^ may be computed from and w^ ac-

cording to our analysis of incomplete elimination. The choice

= D[A^^ ] lwOO is a simple one, and we have -y^'* [j ^ |(B[A^^]|

The parameter k should be determined both for computational economy and for a suitable approximation. Here the quadratic decreases of B[A^ ] are

important and useful in some (but certainly not all) situations.

(k)

Regardless of the source of the approximation yv , let us suppose

y

(id-1)

(k) (k) (k) (i)

that y^ = x) + In the back substitution, y is computed from

by

y<» = y<i+1)

y2j Yj

(i) = (b(D )-l(w(i) . a(D d+1) . c(i) yCi+D) + e(i) ,

y2j-l C 2j-l; (W2j-l a2j-l yj-l C2j-1 yj ; + 2J-1*

with representing rounding errors. We then have

A(i) = (i+1)

2j

6(i) = e (i)

2j-l 2 j- 1

/h(i) x"l,,(i)

( 2j-l} a2 j- 1 >1

+c<« 5<i+1))

2j-l J

Theorem 7.2. If j| s(l) |j S (1 - || B[A(l) ] || )

'<i+1> II then |H(i) || - ||6<1+1)|t

88

As in the analysis of roundoff error propagation for complete odd-even reduction, we have

i(i-) II /II

|| 5^|l ^ax(|| 6^' ||,

s(l)||+ || B[A(l)]|| || 5(l+1) || ).

From this expression and the hypothesis we have || 6^ || £ || |j .

But, also by construction, j| 5^ || ^ || || .

QED

This theorem requires some explanation. First suppose that no round-

ing errors are committed, so that = 0 and all the error in y^ is

.00

derived from the approximation error in y . It now follows from |j B[ A ] || < 1 that this error will not grow during the back substitution.

From | x

00

s || X || , we conclude that ||x-y || / ||x || ^ -y ^ || ' ||

In fact, it is easy to see that while 1 1 x y|| = || x^ - y <'k'> ||, many

vector components of x-y will be much smaller in size than J| x - y j j .

In Figure 7 . 1 we illustrate the errors by component for the tridiagonal

system (-1, 4, -l)x = v, >1 = 31, k = m-1 = 4, v chosen so that Xj = 1,

and y^ = b^k^ “w^k^ . In the absence of rounding errors, the back sub-

stitution quickly damps out approximation errors, creating a very pleasing superconvergence- like effect.

Now suppose that rounding errors are committed and we have the upper

bound

.(i)

£ e. If

;(k)

^ ve for some moderate constant v then we

have essentially solved the system A^x^ = w ^ exactly, and the error

propagation analysis is the same as for complete reduction. That is, we

(i)

have || || ^ (k - i + v)e, 1 ^ i s k.

;(k)

The remaining case is when || || » e. Here the semidirect method

must be considered in competition with an iterative method. The hypoth- esis of Theorem 7.2,

6^ || (1 - || B[A^ ] || ) 5 e, should be viewed as

90

an attempt to formalize the condition j| 5^ || » e . As long as the hypoth- esis remains true, the difference (in norm) between x^ and y^ can en- tirely be attributed to the original approximation error. As soon as the hypothesis is violated it is possible for the errors to grow, but we will still be able to say that || x - y|| = || 6^ || +• 0(ke) . Although the super- convergence effect would be moderated somewhat by roundoff, its qualitative effects will still be felt and we can expect this bound on accumulated er- ror to be an overestimate.

7 . C. Applicability of the Methods

We close with some remarks on the practical application of the qua- dratic semidirect methods. In Buzbee's analysis [B17] of the Buneman algo- rithm for the differential equation (-A + d)u = f on a rectangle with sides c ^ and c , da nonnegative constant, it is shown that if

1 2 T2 ^ 2

+

_2

] » 1

then a truncated Buneman algorithm will be useful. Geometrically this re- quires that the rectangle be long and thin if d is small, and hence the block tridiagonal matrix will have small blocks when properly ordered.

By directly considering the finite difference approximation to Poisson's equation on a rectangle with equal mesh spacing in both directions it is possible to reach a similar conclusion; this also serves as a simple model for more general elliptic equations. With M nodes in one direction and N in the other, M < N, the matrix A is (-1^, P^, - 1^) ^ , P = (-1, 4, -1) .

It is readily verified that || B[A]j|= 2 | j QM 1 1 , = P^1,

AD-A046 625

UNCLASSIFIED

CARNE6 I E-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER —ETC F/6 12/1 OIRECT AND ITERATIVE METHODS FOR BLOCK TRIDIAGONAL LINEAR SYSTE— ETCCU) APR 77 D E HELLER N00014-76-C-0370

91

\k

l D + D 1

_ rn m- i.v «

-(1 ), n = 2m,

2D

j(l - n = 2nrt-l,

a

»0' l> Dl'4> Dn'4 Dn-1 “n-2 *

The following table gives j| B[A]|| for M = 1(1)6, along with estimates

on the number of odd-even reductions needed to obtain || B[A^ ]|| < 10

k such that ]| B[A^^]||< 10 ^

M

1

2

3

4

5

6

B[A ] j

2 = -5

! * -67 | = .86

If*- If* - If*-

lower bounds

(7.2) (7.3)

upper bound

(7.1)

8

9

10

11

These figures again lead to the conclusion that the best use of the semi- direct methods for elliptic equations will be with long thin regions. In the next section we consider some iterative methods which use matrix split tings and orderings to take advantage of this property.

92

8. ITERATIVE METHODS

In this section we consider the iterative solution of a finite dif- ference approximation to a general nonseparable elliptic equation. Of particular interest will be methods which employ, as part of each itera- tive step, the direct or semi-direct solution of a block tridiagonal linear system. Continuing our emphasis on methods for a parallel computer, we want this system to be such that the method already considered can be used efficiently.

8 .A. Use of Elimination

Let us first note that several of our theorems about block elimination have applications to iterative methods. Consider the system Ax = v and the block Jacobi iteration * B[A]x^ + D[A] ^v. Suppose that

|j B [ A ] 1 1 < 1, so the iteration converges; this condition is satisfied for many elliptic equations with suitable partitioning [V3], If the structure of A can be simplified by an elimination step, forming A'x = v1 , then subsequent computations will be handled more efficiently, and since jj B[A' ]j| s jj B[A]|| the iteration will not converge any slower.

Indeed, convergence of the block Jacobi iteration will often be quite slow. When A is also positive definite the Jacobi semi- iterative method (J - SI or Chebychev acceleration) [V3]

x(l+1) = vi+1(B[Alx('l) 4- D[A]_1v - x(l_1))+ x(l"lV

vx - 1, v2 - 2/(2 - o ) ,

V.+1 = (1 - P2v1/4)“1, o = o(B[A]),

will converge at a much faster rate and is a natural candidate for parallel

93

computation. If we use 0 = || B[A]|| as an upper bound on p and use the sequence of parameters

* * , ? v1 = 1, v2 = 2/(2 - 0Z),

Vi = (1

2 * , - 1 3 v./4) \

then the J-SI iteration for A' will not be slower than that for A. By "preprocessing" Ax = v we can expect that the actual computation time will be smaller; unfortunately a complete proof of this claim is not at hand.

Juncosa and Mullikin [J2] discuss the effect of elimination on itera- tive methods when A = I - M, M 2 0, m = 0, and m SI with strict

s

inequality for at least one value of r. Their particular interest for el- liptic boundary value problems is to order the boundary nodes followed by the interior nodes, and then to eliminate the boundary nodes. This would be useful for domains with curved or irregular boundaries. It is shown that, with A' = I - M' representing the system after eliminating one point, p(M') £ p (M) . Thus the point Jacobi iteration will not be slower. Also, if G = (I - L[M]) ^ U[M] is the matrix describing the Gauss- Seidel itera- tion, then o(G') ^ p(G). Similar results follow from work on G-matrices [B4]. As in the proof of Corollary 3.2 we obtain corresponding results for block elimination, though not under the more general assumption

II B[A] || < 1.

8 . B. Splittings

The block Jacobi iteration is created by "splitting off" the block diagonal portion of A, and using the iteration D[A]x^+^ = (D[A]-A)x^ + v. Many other iterative methods are also defined by a splitting A = - M2 ,

MjX*'i+^ = + v. When is a block tridiagonal matrix, various direct

and semidirect methods can be used to solve the system for x^

94

A particularly effective procedure discussed by Concus and Golub [C4] is readily adapted to parallel computation. The equation

s -7 (a(x,y)vu) = f

on a rectangle R, a(x,y) > £) and sufficiently smooth , ^undergoes a change

~2 ~~2

of variable w(x,y) = a(x,y) u(x,y) and scaling by a to yield the equiv- alent equation

- Aw 4- p (x , y) w = q

.11 1

~ 2 2 ~ 2

on R, p(x,y) - a A(a ), q = a f. The shifted iteration (-A + k)w ,

n-r I

= (k - p)wn + q, k a constant, is solved in the discrete form (-A^ + KI)w^n+^ = (KI - P)v(n) + Q, where P is a diagonal matrix. Since the system (-A^ + KI) b = c may be solved by the fast Poisson methods and Chebychev acceleration is applicable, the overall method is highly paral- lel and rapidly convergent.

8.C. Multi- line Orderings

There are a number of more classical iterative schemes which involve solution of block tridiagonal systems. We begin with the multi-line itera- tions for a rectangular grid; the ordering of unknowns and partitioning for 2- line iteration on a 6x6 grid is shown in Figure 8.1. A diagram of the resulting matrix is shown in Figure 8.2.

The underlying technique of the multi- line methods is to divide a

square or long and broad rectangle into a set of long and thin rectangles. A discrete form of the elliptic equation is then solved over each subrec-

tangle as part of the iteration. By partitioning the grid into groups of a few lines we obtain systems that can be solved well by general block

■■■■

97

methods, particularly the semidirect methods of Section 7.

The multi- line block Jacobi iteration with Chebychev acceleration

is probably the simplest method to implement on a parallel or pipeline

computer and should be considered as a benchmark for other methods. With

an nxn grid and a k-line iteration, n = kn' , the splitting A = is

such that consists of n' decoupled block tridiagonal matrices, each of

which has kxk blocks with N =* n. By storing zeros in strategic locations

may be regarded for computational purposes as one block tridiagonal

2 ,

matrix with kxk blocks and N = nn' = n ,/k. Lambiotte [LI] points out that only log n odd-even reductions are needed because of this construc- tion, rather than log N reductions as for the general case. When incom- plete odd-even reduction is used, the number of lines may be chosen ac- cording to the guidelines given in Section 7.C for Poisson's equation.

8 .D. Spiral Orderings

Another useful ordering of unknowns for parallel computation is the

single spiral order diagrammed in Figures 8.3 and 8.4. The spiral order

essentially produces a very long and thin rectangular region, which is

represented by the tridiagonal matrix forming the central portion of the

matrix in Figure 8.4. In the splitting A =■ - M0 we take to be the

2 2

tridiagonal center of A. For an nxn grid is n xn and has || B[M^]|| about thus incomplete odd-even reduction can be used quite efficiently to solve systems involving .

We note that the single spiral order has been used with another split- ting to yield the block peripheral method [B9], [E2]. Here the blocks are of the periodic tridiagonal form, and the convergence rate of the periph- eral block SOR method is shown experimentally to be similar to the 2-line

block SOR method [BIO],

98

The double spiral ordering (Figure 8.5, 8.6) is an attempt to com- bine features of the single spiral order with advantages of a 2-cyclic

1 2

ordering. For an nXn grid, n odd, one spiral contains -(n + 2n - 1)

1 2

points and the other contains (n - 2n + 1) points. Multi- spiral or -peripheral orders can also be defined.

For the single and double spiral orders, when n is even the trailing submatrix of A looks like

The *'d elements of A should be included in the implicit part of the itera- tion, M^ . This can be done at little extra cost, and ensures that each row and column of has at most two non-zero elements. When n is odd a similar modification should be made.

The mapping from the natural ordering to the peripheral or single

2

spiral ordering is defined by a permutation of the vector T = (l,2,...,n )

2

into a new vector S = (S(l) ,S(2) , . . . ,S(n )). With an nxn grid, point (i,j) is unknown number (i-l)n + j in the natural ordering. Given T, two vec- tors forming the coordinates (i,j) of each unknown are computed by taking quotient and remainder:

q(k) = L(k-l)/nJ, (1 £ k s n2) i(k) = 1 + q(k), (1 £ k £ n2) j(k) = k - nq(k) , (1 £ k £ n2) .

For simplicity let n = 2m- 1. Define

P ( i » j)

min(i,j, 2ra-i, 2m- j) , (1 £ i, j £ n) .

The point (i,j) is part of peripheral number p(i,j), 1 ^ p(i,j) ^ m.

100

Peripheral p has 8(m-p) points in it, except for peripheral m which con- tains the single point (m,m).

The peripheral ordering is per. 1, per. 2, . .., per. m, where each peripheral is taken in clockwise order: top, rhs , bottom, lhs. Define

P-1

f(i,j) = 8 (m-k) = 4(2m-p) (p-1) ,

‘ls-1

the number of elements contained in peripherals 1,2 , . . . ,p(i, j) - 1, and

g(i.j) = f(i,j) +

(i-p) + (j-p) +1 if i ^ j,

8 (m-p) - (i-p) - (j-p) +1 if i > j

= 4p(2m-p) + 1

-8m + (if i £ j) - (if i > j)

(2p + i + j)

The vector S (k) = g(i,j) defines the necessary permutation, and can be easily evaluated on a parallel or pipeline computer. The actual reordering of a vector may be performed on the CDC STAR by using the built-in vector permute instructions, though these should be used sparingly since they are quite expensive. Once the vector S is generated, the elements of A and v may be generated directly.

Of course, the implementation of an iterative method based on an "unnatural" ordering involves much more developmental attention than is pre- sented here. We only mean to show the potential for parallel computation.

A more complete comparison based on actual machine capabilities has not been performed. For such a discussion of some other methods, see Lambiotte [LI].

101

8 .E. Parallel Gauss

We close our section on iterative methods for block tridiagonal sys- tems with some brief remarks on the Parallel Gauss iteration due to Traub [T2] with extensions by Heller, Stevenson and Traub [H4]. For simplicity we consider the normalized form A = (a^, I, c^) .

In the block LU factorization A = ( , I, 0) (0, d , cj) > let D = diag(d^, . . . ,d ) . Since D = I - L[A]D Hl[A], a natural parallel itera- tion is = I - L[A](D^ ^ ) Hj[A], where is some initial estimate

of D. This forms the first of three iterations making up the Parallel Gauss method; the other two parts are Jacobi iterations for the L and U block bidiagonal systems. It is shown in [H4] that, with \(A) ^ 1, u(A) - (1 - vTrX)/(l + , we have || D - D(0 || £ u|| D - D(l_1) || .

Techniques for reducing the constant p, are given in [H4]. As Stone [S2] points out, this linear convergence is much less attractive than the qua- dratic convergence of odd-even reduction. However, in certain situations (parabolic equations or the multi- line iterations discussed here) it is necessary to solve a sequence of related problems, and this would hope- fully provide suitable initial estimates.

Let A' = (a\, cj^ an<* suPPose that D' has already been computed.

Since d'. - d. = a, d * c. . - a'.(d'. .) ^c' , , by Theorem 3.4 J J j j-1 j-1 J J-l j“l

d'. - d . II ^ (h (• -j—

j J" v4 V1 + ,/1-V

1

2r-r) + (r~) (

l + .a-v

= 1 - jCvO-X + ,1-V).

Thus || D ' - D || = max j || dj - d^ || 5 1 - -j(> fl- X + , ^1- X' ) , and we may take D' as an initial approximation to D. Now, the Parallel Gauss iteration is

102

most useful when X and X' are not too close to 1, and in this case good initial approximations will be available by the above bound on J | D ' - D | j . When X and X' are close to 1 the bound will be large, but in this case the iteration should probably not be used due to slow convergence.

For tridiagonal matrices we have

so that

Id1. - d 1 j ;

di

dj - dj ' Vl,dH 'j-!

- (di-irl(ij c>i ai 'j-P-

1 - ./l-X

- d.

i + 777 1 j-i j-i

+

1 + - 1- X'

a c

. 1 « .v. . 1

J J-l J J-l1

Let A = max . I a .c . - a’.c'. I < (X + X1 )/4 , 6 . = Id' - d | ,

J J J-l J J-l1 J 1 j j

a = (1 - ..T7x)/(1 + ./l-X') , b = 2/(1 + - '1- X' ) . Then *> = 0, 5. s a 5. , + b,

1 J J-l

so by induction 5^ £ (1 + a + . . . + a^-2') b < b'' (1-a) , or 2tJ /l-X + yl-X’ )

By the intermediate value theorem (yl- X + . ;1- X’ )/2 = .4-X’ for some \ be-

tween \ and V , so that 5^ ^ A/-/1-X . Again we have the situation where,

-'<■

if X is close to 1, a large upper bound results, but here the iteration

should not be used. When X is not so close to 1 the bound may give more

complete information than the one derived for block tridiagonal systems.

Let us now consider the utility of the Parallel Gauss iteration for

the parabolic equation ut = (au^)^, a = a(t,x) > 0, u = u(t,x), with

0 s x SI, 0 St, and u(0,x) , u(t,0) , u(t, 1) specified. Let u1 c* u(ik, jh) ,

2

k = At, h = Ax, p = k/h . The Crank-Nicolson method for advancing to the next time step must solve the tridiagonal system

/T , P ni+l\ i+1 /T P „i. i (I + - P ) u = (I - I p )u ,

P1

(- a

j- 1/2* °j-l/2 + aj+l/2 * * aj+l/2

/o)

103

We therefore have

a / i+1 , , i+1 i+1 \ /o i+1 i~\

A Daj-l/2’ 1 + P(aj-l/2 + aj+l/2) " Paj+l/2' 2);

let us now drop the superscript i+1 for convenience.

A will always be strictly diagonally dominant because a(t,x) > 0,

and we would like to know when we also have \(A) ^ 1. Using the mean value

theorem, define x. and x. by J J

aj-i/2 + aj+i/2 - "aj-l/2 " °y bj = ax((i+l)k, Xj) ,

a 3-2/2 + aj- 1/ 2 = 2a j- 1/2 h V

f? . = a ( (i+1) k, x .) .

J x J

After some algebraic manipulation it is seen that

\(A) = max.(l + (l + -rhb,)) ^

J paj- 1/ 2 2 J

x (1 + (1 - ■§ h S.))”1.

paj-l/2 2 J

In order to guarantee \(A) s 1 it is sufficient to assume th. t h and k satis- fy ~ h|ax(t,x)| < 1. If m^ = max|ax(t,x)| this becomes the condition m^k £ 2h.

_2

Suppose we want the more stringent condition \(A) £ (1 + c) , c £ 0. This may be satisfied if we choose h and k such that, for each j,

2a . / + h b . ,

CPaj-l/2 S 1 + Dhbj 2> cpa j 2 £ 1 - phH^ ' 2 .

With ra^ = max a(t,x), a sufficient condition is cpm^ s 1 Phm^ 2, or

104

(8.1) k £ 2h2/(2cm0 + hn^) .

For c = 0, (8.1) reduces to the previous condition on h and k, while for c = h we have k £ 2h/ + m^) = 0(h) .

Now, one of the best features of the Crank-Nicolson method is that k may be chosen to be 0(h) . We have seen that a small value of c still allows this choice. Conversely, if k = rh satisfies (8.1) and rm^ < 2, then c £ h(l - rm^/2)/ (rm^) - 0(h). When we try to pick h and k so that the Parallel Gauss method will be effective (c » h) we are necessarily restricted to small time steps. When h and k are chosen in the usual way (k = 0(h)), Parallel Gauss cannot be expected to be very effective since \(A) will be close to 1. Thus the method appears to be less attractive for this particular application than originally believed, and odd-even reduction would be preferred. Once several reductions have been performed, \ and u. may be sufficiently small to warrant use of Parallel Gauss and its

variations .

9. APPLICATIONS

In this section we discuss two applications of practical importance in which the assumption || B[A]I| < 1 is not met; in the second case it is shown how to get around this problem.

9 .A. Curve Fitting

Cubic spline interpolation of the data (x^,f^), 1 £ i £ N + 1, gives rise to the tridiagonal system [C6, p. 238]

h.s. + 2 (h . + h . . ) s . + h . .s,,. l i-l v l L-l 1 1-1 1+1

= 3(f[xi_1,xi]hi + f[xi>xi+1]hi_1) .

2 £ i £ N,

h. = x l i+l

x. .

i

We have A = (h^, 2(h^ + h^ ^) , h^ and |( B[A]|| = , which is quite satis- factory. However, more general problems need not behave so nicely even though they might give rise to block tridiagonal matrices.

Cox [C7] presents an algorithm for least-squares curve fitting with piecewise polynomials which must solve the system

The v vectors contain coefficients of the Chebychev polynomial expansion

106

of the piecewise polynomial, and the X vectors contain Lagrange multipli- ers derived through a constrained linear least-squares problem. The C/s always have full rank, and if there are sufficiently many data points between adjacent pairs of knots then the H/s will all be positive defi- nite. In this case a s traightf orvard band elimination may be used well when the number of knots and data points is large. When insufficient data is available the H s will not have full rank and pivoting is needed in the band elimination. It seems that the condition j| B[A]||< 1 will not hold for any reasonable partitioning or data. Indeed, B[A] is not even . defined for the most obvious partition.

9 .B . Finite Elements

In Section 2 .A we noted that matrices constructed from finite element approximations to differential equations often do not satisfy assumptions such as jj B[A]|| < 1. Thus some other technique must be used to establish numerical stability in a direct method or convergence in an iterative method. Usually it is shown that A is positive definite. The following example, admittedly a simple one, shows how an easily implemented change of variables can yield a new system A'x' = v' such that || B[A' ]|| < 1.

Even if the change of variables is not actually made, this will show that II B[A] It*, < 1 for some norm || ||^, and also serves to establish numerical stability.

Let us begin with the equation -u" = 1 on the unit interval. Using Hermite cubics, Strang and Fix [S4, p. 59] derive the 4th order accurate

finite element equations

107

, u . , , - 2u . + u . , . u' - u'

-4(-l±i r1 ti) + h - J'h - 1,

(9.1)

, u , , - u . ,

-I/_ J±i 111

5 ' 2h

2h

> + I"j - Hi ' 2ui + "i-i*

= 0

Taking the natural choice x^ = (u ^ , u^) , we obtain the block equation

aXj_1 + bxj + cXj+1 = (1,0)

The matrix A = (a, b, c) is positive definite, but || B[A]|| = + 3 ' (4h) . The problem is simply that the unknowns are out of scale.

A second set of unknowns, which are in scale, is

T h h T T

x. = (u. - TT u'. , u. +- u'.) aa (u. u.ll/o) . Since the Hermite cubi

j j 2 j* j 2 j J-12 j+1/2

can be derived from B-splines by coalescing adjacent nodes, these unknowns are actually a more natural choice than the original. We now have

res

'1 -r

,1 1,

'1 0 v0 h/

u . J

and, multiplying the first equation of (9.1) by h and the second by h,

2 T

aXj ^ + bXj + cxj+^ = (h 1, h 0) ,

108

The blocks are independent of h and we have |! B[A]|j = 1, with o^(B[A]) < 1 for the first and last block tows. (Actually, this assumes Dirichlet boundary conditions ; Neumann conditions give p^(B[A]) = 1.)

For the more general equation -u" + Qu = f, Q a constant, the follow- ing equations are derived [S4, p. 59]:

-L (.36 u.^ - 3h u'_L + 72 u . - 36 uj+1 + 3h uj+1)

+ §0(54 Uj-1 + 13 h uj.! + 312 u. + 54 u.+1 - 13h u^) = F j *

io(3 uj-i -huj-i + 8h uj 3 Vi -by

+ io(-13 uj-l - 3h uj-l + 8hu^ 13 uj+1 - 3 h u'+1)

= F'..

J

Again taking = (u^ ^ u j > Uj + ^ uj)T anc* suitably scaling the equa- tions, they become

ax . . + bx . + cx . , = (hF.,F'.) J-l J J+l J j

2

where now the blocks are 0(h) perturbations of the blocks in (9.2). Omit-

5 2 U

ting details, for h sufficiently small this gives || B[A]|| = 1 - Qh + 0(h ).

109

Since rescaling and change of variables in the linear system is equiv- alent to a change of basis in the original function subspace, the process should be applicable to more general differential equations solved by finite element methods. The underlying technique is to prove something for spline approximations, and show that the property carries over in the collapse of nodal points yielding the Hermite cubic approximations.

Strictly in terms of modifying linear systems by a block diagonal scaling, it is natural to ask if either of the following problems are solvable in general:

1. Given an arbitrary block tridiagonal matrix A, find a block diagonal matrix E such that || B[AE]|| < 1.

2. In 1., if || B[A] || < 1, find E such that || B [ AE ] 1 1 < || B[A]||.

These problems are closely related to the optimal diagonal scaling of matrices to minimize condition numbers, and as such are outside the scope of this thesis.

110

10. IMPLEMENTATION OF ALGORITHMS

This section contains some general remarks on the choice of algorithm for particular applications, and a more detailed comparison of algorithms for use on a vector computer. In the comparison we assume uniform nxn block sizes, later specialized to n = 2. All algorithms will use the same data layout, assumed to be core-contained, and the execution time compari- son will be in two parts: first for a very simplified and idealized model in which the machine vectors are defined as storage locations in arithmetic progression, and second for a model in which the machine vectors are defined as contiguous storage locations (arithmetic progression with increment = 1) . The first model uses information for the CDC STAR- 100 ignoring, among other things, the fact that machine vectors on STAR must satisfy the conditions for the second model. This double comparison allows us to pinooint the effect of data manipulation required by the stringent requirement on vec- tor operands. For a description of STAR facilities necessary for this dis- cussion, see Section 2.B and Appendix A. Implementation of algorithms for tridiagonal linear systems, n = 1, is adequately discussed elsewhere ([H4], [Jl], [L2], [Ml], [S2]) and these references may be considered as back- ground material for this section.

10. A. Choice of Algorithm (Generalities)

The salient point of our analysis has been that a number of algorithms may be applied to block tridiagonal linear systems that are likely to arise in practice. Sometimes it may be possible to take advantage of special features in order to reduce computing costs or estimate error. When the computing costs are high (such as one enormous problem or a large problem solved repeatedly) an extra effort must be made to choose the most

Ill

efficient algorithm. This choice will be difficult to make in general, but we feel that it is not at all difficult to choose good algorithms in many cases.

Regardless of the computer involved, algorithms for three special problems should be implemented before proceeding to algorithms for more general problems. The first should deal with general block tridiagonal matrices restricted to blocks of a small uniform size, say no more than 8x8. On a standard sequential computer, band or profile methods are ap- propriate; on a k-parallel computer some form of block LU factorization should be used, and on a vector computer a combination of odd-even reduc- tion and LU factorization should be used. Storage allocation and access, vector operations on a parallel or vector computer and implementation issues in general can be greatly simplified under the block size restric- tion. Options include implementation as a set of subroutines for specific block sizes, insertion of tests for use as a semidirect method, and spe- cial scalings or pivoting strategies for efficiency or stability. This restricted class of algorithms may be applied directly to small-block problems or used as a module in the iterative solution of large sparse problems as indicated in Section 8. The experience gained may or may not be useful for implementation of methods for moderate-sized blocks, which, in our opinion, should be treated in a different manner.

The best choice of a parallel algorithm for moderate- sized blocks is not clear at this time. These cases represent a cross-over between effi- cient vector implementations of band/profile methods and the odd-even methods. Either one is likely to be satisfactory. In Lambiotte's opinion [LI] the band methods are better for the CDC STAR, but this choice is

heavily influenced by vector addressing restrictions on the STAR (cf. our

112

remarks in Section 10. B) . Other parallel or vector computers may not present such problems, and a more complete analysis is needed.

The third basic class of algorithms are the fast Poisson solvers, including generalizations. Recent studies have shown that many finite dif- ference approximations to elliptic equations can be computed very effi- ciently by iterative or modification methods based on the Poisson solvers. There are several kinds of algorithms to choose from, and all have consid- erable inherent parallelism. Other algorithms are available for elliptic equations, but flexibility and ease of implementation suggest that Poisson- based algorithms should be among the first to be considered. Brandt's multi-level adaptive methods [B131 must also be seriously considered.

Finally, we note that matrix partitioning and block elimination is a useful technique for dealing with matrices so large that secondary storage must be used [R4]. Known as the hypermatrix scheme in other contexts, it has been extensively exploited by the A SKA system, a structural analysis package developed at the University of Stuttgart [F3]. Noor and Voigt [Ml] discuss implementation of hypermatrix schemes on the CX STAR- 100. The method works as follows: suppose a partition tt is given (see Section 2. A), and let A^ be the matrix A partitioned accordingly. Block sizes in ASKA are chosen to optimize transfers between storage levels, while the choice on STAR must also try to maximize vector lengths and hence minimize start- ud costs. The usual situation is that a few blocks of A will be in main

TT

memory, while the bulk of A^ will be in secondary memory. For example, if tt' is another partition such that rr refines rr' , we would perform block elimination on A^, x^, = v , . When A^ is block, tridiagonal so is A^, .

Since the data requirements for the LU or Cholesky factorization are then highly localized, this is a natural procedure [E3]. Moreover, the diagonal

113

blocks of A^, will themselves be block tridiagonal, though possibly of low order. As noted in Sections 2 .A and 3. A. 2, if || BfA^] || < 1 then || B[A_r,]|| £ || jfA^]!!, and pivoting will be restricted to those blocks of A , resi- dent in main memory. This, of course, is precisely what is desired.

10. B. Storage Requirements

Data layout is an essential part of planning an algorithm' s imple- mentation on any parallel computer, since computational speed can be severely decreased if data cannot be accessed conveniently. On a vector computer this means that conceptual vectors must conform to the machine's definition of a vector operand. We consider some consequences of this requirement for implementation on the CDC STAR- 100, assuming that the prob- lem is core-contained.

Storage available on the STAR, for 64-bit words, consists of 256 reg- isters and either 512K or 1M words of main memory. A sizable secondary disc storage is also available but will not be considered here. A vector on STAR consists of s consecutive storage locations, 1 £ s £ 64K.

Our data arrangement for block tridiagonal linear systems with uniform nxn blocks is illustrated in fig. 10.1 for the case n = 2, N = 7. Define m =* rlog7N" , N* = 2m, and if N < N* then extend Ax = v with N* - N trivial

equations x^ =* 0. The matrix A and vector v can then be held in three

2

storage vectors of length n N* and one of length nN*. Taking the subdiag-

2

onal blocks as an example, the n components of a, are a. . , 1 £ i s n,

k 1 j , k

lsj5n, 1 s k £ N*, and the storage vector a is defined by

a((i - l)N*n + (j - 1)N* + k) =* a , . Thus a total of (3n^ + n)N* loca-

l j , K

tions are needed to store the original system. We do not consider the cost of setting up the storage vectors since each of our algorithms will

114

a

b

c

V

an,i = 0

bH,l

cu,i

vl,l

3 11 >2

bU,2

Cll,2

Vl,2

ail,3

bn,3

CH,3

Vl>3

*11.4

bll,4

Cll,4

Vl,4

aH,5

b.l,5

CH,5

Vl,5

all,6

bll,6

C 1 1 , 6

Vl,6

all,7

bH>7

Cll,7 = 0

Vl,7

a11.8 " 0

bll,8= 1

Cll,9 = °

Vl,8=°

a12,l = 0

bU,2

C 12 , 1

V2,l

al2,2

b12,2

C 12 ,2

Vo 9

al2 ,3

b12,3

C12,3

V9 3

al2,4

b 12 ,4

C12,4

V2,4

a!2,5

b12,5

C12,5

V2,5

a!2,6

b 12 , 6

C 12 , 6

V2,6

ai2,7

b12,7

C 12 , 7 ~ 0

V2,7

al2,S = 0

b12,8 = °

C 12 ,8 = °

a21,l =

0

b21,l

C21,l

a21,2

b21,2

C21,2

a21,3

b21,3

C21,3

a21,4

b21,4

C21,4

a21,5

b21,5

C21,5

a21,6

b21,6

C21,6

a21,7

b21,7

C21,7 =

0

a21,8

0

b21,8 = °

C21,8 =

0

a22,l =

0

b22,l

C22,l

a22,2

b22,2

C22,2

a22,3

b22,3

C22,3

a22,4

b22,4

C22,4

*22,5

b22,5

C22,5

a22,6

b22,6

C22,6

a22,7

b22,7

C22,7 =

0

a22.8 =

0

b22.9 = 1

C22,8 =

0

Fig. 10.1. Data layout for n=2, N = 7 , N* = 8 .

115

use the same arrangement and since the details can very widely between systems. This particular data layout has been chosen because it makes the odd-even reduction algorithm more efficient for the true STAR model, while the other competitive algorithms (LU factorization, odd-even elimination) are mostly unaffected by the choice of data layout, and can do without extra storage for trivial equations.

Since the STAR forces vector lengths to be £ 64K, we must have

?

n“N* ^ 64K = 65,536. For a given value of n, this restricts the system size as follows:

n

2

3 j

1 4

| 5

1 6

| 7

8

64R/n2

16384

7281

4096

| 2621

1826

1337

1024

max N*

16334

4096

1

4096

2048 1

1 1

1024

1

1024

1024

These will probably not be serious restrictions in most cases.

A more important restriction on system size is our assumption that the problem and solution will remain core-contained. Assuming a 512K main store, and supposing that temporary storage plus the program amounts to T times the original storage, we obtain the following restrictions on N:

n

2

3

4

5

6

7

| 8

512K T- 1

18724

8738

5041

3276

2279

1702

1310

(T 4- 1) (3n + n) T = 2

12483

5825

3360

2134

1533

1134

873

T = 3

9362

4369

2520

1638

1149

851

655

max N* T 1

16384

8192

4 096

2048

2048

1024

1024

T = 2

8192

4096

2048

2048

1024

1024

512

T = 3

3192

4096

2048 I .

1024

1024

512

512

116

1

The amount of temporary storage used depends on the algorithm and the extent to which A and v are overwritten. It can range from practically no

9

temporary storage (other than 0(n“) registers) for the LU factorization to T sa m for an inefficient implementation of odd-even elimination. Odd-even reduction would typically have 1^2.

We now consider the costs of data manipulation for the basic algorithms. Major considerations are the speed of the load store unit on STAR, for reg- ister-memory transfer of scalars, and the requirement that vectors be stored as contiguous memory locations.

The LU factorization, executed mostly in scalar mode: The only data manip- ulation is to load store between memory and the register file. To a con- siderable extent this may be overlapped with arithmetic operations. A complete analysis requires cycle counting in an assembly language program and timings from actual runs. Our estimate of execution time (based on an untested cycle count) is that the algorithm will probably be I/O bound.

With this in mind, we suggest four possible implementations: (Times given are for n = 2, and represent minimal run times assuming complete overlap between scalar arithmetic and load stores. Actual run time will certainly be greater. For details, see Appendices A, B.)

Version 1, a one-time solution, all scalar mode

time a 50SN - 362

Version 2, factor + solve, partial vector mode

factor, version a, time a 324N + 3421 (use for N > ~ 38) version b, time a 374N + 1769 version c, time a 420N - 230 (use for N < 33) solve, time a 303N + 1039

4

117

1

The three versions of factorization use increasingly fewer vector opera- tions and increasingly more scalar operations. If more than one system must be solved with the same matrix A, it is best to use Version 2. Our execution time comparison in Section 10. C will use Version 1.

Odd- even elimination, executed in vector mode: An important feature of this algorithm and any data layout like ours based on component vectors is that the conceptual vectors all correspond to the strictest require- ments for machine vectors, and hence no additional data manipulation is needed. (cf. [Ml] for the tridiagonal case.)

Odd- even reduction, executed in vector mode: If the vector computer model allows vectors to be storage locations in general arithmetic progression, then no additional data manipulation is needed. In a true model of the STAR, however, we need to perform two basic operations: (1) separate a vector into its even and odd components, and (2) rejoin the even and odd parts of a separated vector. The first is accomplished with two STAR compress instructions, the second with a STAR merge instruction. The odd- even separation is needed before each reduction step can be performed, and

must be applied to vectors a, b, £, and v. We illustrate the separation

2

for b in fig. 10.2. Having set up a control vector z consisting of n N*' 2 copies of the bits 10 (a single vector operation on STAR) , we execute

compress b -> bl(l) per z

b(2j - 1) - bl(j) , (lSjS |n2N*)

12

compress b -* M(— n N*) per z

b(2j) - bl(jn2N* + j), (1 * j s jn2U*).

2 112

The cost for each compress is n N* 4- -t-(— n N*) + 92 and the total cost to

o 2

118

b

z

bl

final state

1

b

h CD

11,1

11,1

11,1,,,

11,2

0

bn,3

b <X>

11,3,,,

11,3

1

bll,5

b

11,5..,

11,4

0

bll,7

b (1)

11,7 ,,,

b

1

b

b (1)

ij,k

11,5

12,1

12,1.,,

0

b

b CD i

^ k odd,

11 ,6

12,3

b12,3

1 S k S n*

b

b

12,7

12,3

21,1

321,2

321,3

>21,4

1

0

1

0

1

0

b

21,3

b

b

b

b

22,1

22,2

22.3

22.4

0

1

0

1

0

22,3

b

b (1)

22,5

b

°?2 s b (1>

22,7

22,7

b ,

b (2)

11,2

ll,l.o.

->

b

b (2)

11,4

H,3.o,

b

b (2)

b (2)

11,6

b

b12jl(2)

ij ,k

11,8

12,3

>

k odd,

1 £ k £ N* 2

b ,

b ' <2)

12,8

22,3

b

b (3)

21,2

11,1.0,

b

b <3)

21,4

11,2 .o.

b2 1,6

b (3)

12,1,0,

b <3)

b

. (3)

V

ij,k *

21,8

12,2

1 s k ^ n* 4

b

b ' (3)

22,8

22,2

J

Fig. 10.2. Odd-even separation for n=2,N=7, N*=8.

119

17 2

odd-even separate a, b, c, and v is (3n + n)N* + 736. Our initial

o

extension of A to N* block equations makes it possible to perform this

?

step using only 3 vector compress operations, independent of n“ . If A

2

had not been extended we would need either 2 (3n + n) vector operations

2

to compress the vectors one component subvector at a time, or n vector

2

operations to generate a modified control vector consisting of n copies of z(l:N), followed by the original 3 vector compress operations.

Lambiotte [LI] rejected use of odd-even reduction for n > 1 on this basis.

The net effect of our extended data layout is to reduce the number of vector operations needed to manipulate vectors for the STAR, and thus to reduce vector startup costs. On the other hand, we have increased vec- tor lengths and hence storage from N to N*, and this may be a more impor- tant consideration in some cases. The modified control vector approach

would then be most natural despite the increased startup costs. In any 3

case, 0(n ) vector operations are needed for the arithmetic part of the reduction step.

So far we have considered only the odd-even separation to prepare

for the first reduction. In general, the cost of the separation to pre-

17 2

pare for the ith reduction, 1 £ i £ m - 1, is (3n + n)N* + 736,

o L

ITt = 2m+^ and the total cost for m- 1 reductions is -^-(3n“ + n) (N* - 2)

+ 736(m-l). For the case n = 2 this yields 59. 5N* + 736m - 855. We will see shortly that this is roughly half as much time as is spent in the arithmetic operations, so the overhead of dealing with a restrictive machine vector definition can be considerable.

Data manipulation for the back substitution is simplified by saving the odd- indexed equations of A^ in appropriate form for vector operations (see fig. 10.2 for the final state of b) . Having computed x^i+^ = the

120

even components of x^^ , we compute the odd components of x^^ and merge the two sets into x^ . The merge operation on STAR is comparatively more expensive than the compress operation, but we are only dealing with the solution vector and not blocks of A. The cost of the merge operation at stage i is 3n.\\* + 123, and the total cost for m- 1 reductions is 6n(N'v - 2) + 123(m - 1) . For n = 2 this yields 12N'V + 123m - 147.

p-fold reduction: Suitable data arrangement and use of control vectors for p-fold reduction can readily be generalized from our discussion of odd-even reduction. The odd- even separation generalizes to separation into p vectors, which requires p compress instructions per storage vector;

^ 2 jjyL 1 » j

the cost per reduction would be (p + ) (3n“ + n)N.* + 363p, N.* - p‘ ,

O 1 1

m = log N 1 . The back substitution requires merging d vectors into 1,

1 1

using p-1 merge instructions at cost of 3 - lnN.* + 123 (p- 1) to go

, (i+1) (L) _ . , , , .....

trom x to x . It is clear that these costs are minimized bv

Jacobi iterations: Data manipulation is unnecessary when A and v or BrA' and D[A] ^v are stored according to our component subvector scheme.

10. C. Comparative Timing Analysis.

We close with an execution time comparison of algorithms for general block tridiagonal systems with 2x2 blocks, based on timing information for the CDC STAR- 100. This will expand the crude comparison of the LU factor- ization and odd-even reduction given in Section 4.D, and illustrates many earlier statements concerning the cost of various methods. Essential con- siderations include vector startup costs, creating a substantial penalty for operations on short vectors, and data manipulation to deal with

121

restrictive definitions of vector operands. Section 10. B began our dis- cussion of the latter problem.

Our initial simplified analysis is based on the following conven- tions, which allow us to consider only arithmetic costs:

1. we assume

a. vector operands may be any sequence of storage locations in arithmetic progression,

b. pivoting is never necessary;

2 . we ignore

a. instruction overlap capabilities,

b. address calculations, loop control and termination tests,

c. load/store costs for scalar operands,

d. storage management costs, as a consequence of assumption l.a.

Our second, and more realistic analysis will demand that vector operands consist of consecutive storage locations, so we must consider storage manage- ment as explained in Section 10. B. Overlap, loop control and scalar load/ store will also be considered, but not address calculation or termination of iterations. Algorithms and timing derivations are given in Appendix B; here we present the main results and make comparisons.

Execution times for the basic direct methods are initially estimated as follows (m * ’’log2N?):

1. the block LU factorization, using scalar operations only (LU) :

1012N - 7 84

2. odd-even elimination, using vector operations only (OEE) :

94 . 5Nm + 12999m - 93. 5N - 1901

r

122

3. odd-even reduction, using vector operations except for the final

block system (OER) : 106. 5N + 14839m - 14717.5

4. p-fold reduction, p 2 3:

d- 1

145N + 19206m (■

log2p

) - ( 1935 lp - 19579)

We can first dispose of p-fold reduction as a competitive method for gen- eral problems since either LU or OER is always cheaper. Execution time ratios show the benefits of using OER in preference to LU and OEE.

N = 1023

N = 8191

limit, N 00

OER:

LU

.23

.13

.11

OER:

OEE

.24

.11

0.

The importance of vector startups is seen by comparing OER and LU, for the latter is faster when N < 100. OER works by successively reducing vector lengths, and at some point this becomes inefficient. A polyalgorithm combining the two methods is straightforward: if m £ 6 then use LU, other- wise do m-5 reductions, solve the remaining system by LU, and back substi- tute. This reduces the time for OER to 106.5 N + 14839m - 46903.5, and for N 3 1023 it represents a savings of 13 i.

We also note that vector startups are the dominant term for OER when N £ 1532, and are the dominant term for OEE when N ^ 137. Of course, OEE uses 0(N log N) operations vs. O(N) for the other methods and soon becomes inefficient. Nevertheless, it is faster than LU for 455 £ N ^ 1023, although it is always slower than OER.

Our conclusion is that the OER - LU polyalgorithm is the best direct method of those considered here.

Now consider the cost of individual eliminations or reductions with

9

123

respect to the total cost of the algorithm. For odd-even elimination,

each elimination step has roughly equal cost, the last being slightly

cheaper. Thus each step costs about Ira of the total cost. For odd-even

reduction, the cost per reduction decreases as the algorithm proceeds: cost of the kth reduction as a fraction of total cost

1 k ° 1

2

3

4

5

6

total cost

N = 1023, m = 10

00

C^l

17$

12$

3$

7$

7$

242,622

N = 3191, m 3 13

43*

22$

12$

m

4$

3$

1,050,531

(cost for kth red. = 106.5(2m-k) + 14839 when N = 2m - 1)

For small N, when startup costs are dominant, reduction costs are approxi- mately equal, while for large N, when true arithmetic costs are dominant, the kth reduction costs about 12 of the whole. This has immediate con- sequences for the effectiveness of incomplete reduction, since the omitted reductions are only a small part of the total cost when N is large.

For our comparison of semidirect and iterative methods, the goal will be to approximate x with a relative error of l/N, or to reduce the initial error by a factor of l/N. Thus if we perform either k odd-even elimina- tions or k reductions, we want || B[H^+^]j| £ l/N or || B[A^k+^]|| £ I N, assuming jj B[A]j|< 1. Theorem 7.1 is then applied to produce the desired

approximation by incomplete elimination or reduction. Taking 9 = j (B [A ] 1 1,

k k

we have || B[H^+^]||, || B[A^k+1^]j| £ » and demanding 3“ ^ l/N yields

k s log2((log2N) / (- log2 9) ) .

The execution time for k eliminations is

94.5Nk + 12999k + 10. 5N - 105(2k) + 1205,

and the time for k reductions is

124

106. 5N + 14839k - 96(2m'k) + 1287.

Typical savings for incomplete reduction over complete reduction would be: 3 = .9, pick k £ log2log2N + 2.718

N = 1023,

k

= 7,

savings ~ 12)5

N = 8191,

k

= 7,

savings = 7^

2

8 = , pick k

2 log2log2N + .774

N = 1023,

k

= 5,

savings = 25^

N = 3191,

k

= 5,

savings = 12 £

0 3 Pick k

log.

log,N

N = 1023,

k

= 4,

savings = 33--

N = 3191,

k

3 4,

savings = 16^

Comparison to the OER - LU polyalgorithm decreases the savings.

We next consider iterative methods and their combination with semi- direct methods. One step of the block Jacobi iteration costs 22. 5N + 3031, while one step of the block Jacobi semi- iteration costs 26. 5N + 3408 plus overhead to estimate parameters such as p = o(B[A1). Execution time may be reduced to 12N + 1840 (16N 4- 2217) per iteration by directly computing B[A] and D[A] (cost = 33. 5N + 4367 = 3 iterations).

We will use the Jacobi iteration in its simplest form when only a few steps are demanded, and Jacobi semi- iteration when many steps are needed. The number of iterations needed by J-SI to reduce the initial error by a factor of 1/N are estimated to be

125

The LU factorization is always faster than OEE, and is faster than OER for N £ 312. Vector startup costs are dominant in OEE for N £ 152, and in OER for N £ 798. The OER-LU polyalgorithm should perform m-6 reduc- tions, solve the remaining system by LU, and back substitute. The time is 183N + 16233m - 70677, yielding a 16$ saving over OER for N = 1023 but only a 3$ saving for N = 3191.

The execution time for odd-even elimination is allocated as follows:

!

N = 1023 N = 3191

arithmetic

96$

96$

overhead

4 $

4$

results

37 $

98$

vector startups

13$

2$

Analysis of the total time for odd-even reduction shows the greater impor- tance of overhead operations:

I

N = 1023 arithmetic

overhead

N =* 8191 arithmetic

overhead

results

startups

33$

40$

73$

23$

/ d/

27$

56$

results

startups

52$

11$

62$

37$

1$

38$

38$

12$

126

limit, N -*

results

startups

/arithmetic

53 3

-

58%

(overhead

42*

-

423

1003

-

The increase in time for OER due to overhead is 37* for N = 1023, 6l3 for N = 8191, and 723 in the limit as N -* ®. As a consequence of the data layout of Section 10. B, overhead startup costs are quite small, but the overhead result costs cannot be ignored and probably cannot be decreased. The strong requirement that machine vectors consist of contiguous storage locations leads to much "wasted" effort in execution time.

Most of the execution time of OER is actually spent in the arithmetic and vector compress operations:

cost of individual operation as fraction of total cost

N = 1023

N = 8191

limit, N

divide

12*

12*

123

multiply

43*

35*

32*

add / subtract

18*

15*

14 3

compress

203

29*

33*

merge

43

6$

73

transmit

3*

3*

3*

The compress operation assumes more

reflecting its low startup costs.

importance as

N increases, again

As noted earlier,

most of OER'

s time is spent

in the first few reduc

tions and subsequent back substitutions.

127

p = .9, 1.5m iterations,

2

p = ^, .75m iterations,

p = -j, .5m iterations.

2

For p = , for example, J-SI is never faster than OER.

As suggested in Section 7. A, we can combine incomplete elimination or

reduction with Jacobi iterations. If k steps of elimination or reduction

are used, followed by l iterations and back substitution, k and l should ok l

be chosen so that 3 s L N, or

(*) k + logo ^ 2 log2((log9N) / (-log9 S) ) .

The time for the OER-J method would then be

106. 5N + 14839k - 96(2m-k) + 1287 + 2(19(2m_k) + 2615),

2

and it only remains to minimize the time subject to (*) . For 3 = ■j,

N = 1023, the resulting values are k = 2, i = 5, which represent a savings . 2

of 37/: over OER. For 3 = , N = 8191, the resulting values are k = 3, 2=3, which represent a savings of 15$ over OER.

Of course, to determine k and i we need to know 3. While it is true that 3 = j| B[A]j| can be computed directly from A at a cost of 46. 5N + 3881, or from B[A] at a cost of 15N + 308, we recognize that half the rows of B[A] are available as part of the first odd-even reduction. An estimate of || B[A]|| can be computed from these rows at a cost of 7.5N + 308. For large N even this cost considerably reduces any benefits gained from using incomplete reduction, and points out the importance of an analytic estimate of || B[A]|| rather than a computational estimate.

128

This ends our initial comparison of algorithms. We now make the assumption that vector operands are consecutive storage locations, allow overlap of scalar operations, and consider the cost of loop control and scalar load store operations. The net effect is to lower the time esti- mate for the LU factorization and raise the time estimate for odd-even reduction. Odd-even elimination remains substantially unchanged. Rather than simply recompute the figures presented earlier, which will be done in a few cases, we try to indicate where the algorithms spend most of their time.

Revised time estimates are given below. These should give a better indication of actual runtime on STAR, but are probably all underestimates; we note changes from the first set of timings, m = i log^N"! .

1. LU factorization, (LU) 600N - 150

(a 40 p decrease due to use of instruction overlap)

2. Odd-even elimination, (OEE) 98.5Nm 4- 13403m - 101. 5N - 2261 (a 4/» increase due to overhead costs)

3. Odd-even reduction. (OER) 183N + 16233m - 16108

(a 10-702 increase due to overhead costs; larger N incurs larger increases).

Time ratios show the effect of decreasing the LU time and increasing the OER time

N = 1023

N = 8191

limit, N

OER : LU

.54

.34

.31

OER:OEE

.32

.17

0.

129

cost of ith reduction as fraction of total cost

i =

1

2

3

4

5

6

N = 1023

33 4

19#

12 4

34

7 4

6 4

d =8191

45#

23#

12 4

6 i

2$

limit, N -* ®

5016

25 4

134

n

3 4

2 i

Of the total time, 86# is spent in the reduction phase and 14# in the back substitution .

One of the effects of the OER-LU polyalgorithm is to reduce the impor- tance of vector startup costs, as seen below.

N = 1023

= 3191

scalar time

14#

2 4

vector time

864

98 4

results

63*

91#

s tar tups

23 4

7 4

startup costs

are already

small not much

ilete reduction and incomplete

reduction +

results. (9

= II B[A] || ).

Incomplete reduction, savings

over OER

N =

1023

N = 3191

k

savings

k

savings

= .9 7

10#

7

5 4

f 5

i 4

CM

5

9 4

1

27#

4

134

130

Incomplete

reduction

+ Jacobi iteration.

saving over OER

N =

1023

N =

8191

k

J

savings k

l

sav in^s

3 •= .9

3

9

22 4

6

8 i

2

3

2

5

36^ 2

6

1315

_1

47 ^ 2

9

1

5

4

As seen earlier, the seraidirect methods are best applied when the omitted odd-even reductions make up a significant part of the OER time. There is considerable advantage to using incomplete reduction plus Jacobi iterations roughly doubling the savings obtained by simple incomplete reduction.

APPENDIX A

VECTOR COMPUTER INSTRUCTIONS

This section describes those features of the CDC STAR- 100 needed for our comparison of algorithms in Section 10. Information given here is derived from CDC reference manuals and timing measurements made at Lawrence Livermore Laboratory [C2], and by no means represents a complete discussion of the STAR'S facilities.

Execution times are for full-word (64 bit) floating point instructions, given as the number of 40 nanosecond cycles needed for each instruction.

All times should be accurate to within 5^ except as noted, though actual performance may be degraded somewhat due to memory conflicts.

A vector on STAR consists of N consecutive storage locations,

1 £ N £ 65,536. The vector length restriction is usually not serious, but the need for consecutive storage is crucial. Other vector computers, the TI ASC in particular, do not have this restriction. Both machines are pipeline computers with the usual scalar instructions as well as vector instructions .

Useful features of vector instructions on STAR include:

broadcast constants - Most vector instructions allow a scalar (the broad- cast constant, held in a register) to be transmitted repeatedly to form one or both of the vector operands. Use of this facility decreases exe- cution time in a few cases.

control vectors - Bit vectors can modify the effect of vector instructions, such as suppressing storage to selected components of the result vector or determining which operand components are to be used in an operation.

132

Use in vector data manipulation (compress and merge) is explained later.

offset - Modification to the base address of operands.

sign control - Certain vector operations allow the sign of the operands to be modified before execution.

Individual classes of instructions follow.

scalar (register) instructions. Issue time is the number of cycles required to initiate execution. Shortstop time is the time after issue when the result is first available, but not yet placed in a register. The result must be used at the precise time given; if not, the result cannot be used until it is placed in a register. Result-to-register time is the number of cycles after issue when the result is placed in a register. Overlapped

execution is possible, but the exact

effect

on execution

time is bes

determined by

actual testing, which we have

not done.

STAR

result

instruction

operation

issue

shortstop to

regis ter

62

add, normalized

2

6

11

66

subtract, normalized

2

6

11

6B

multiply, significant

2

10

15

6F

divide, significant

3

-

43

73

square root

3

-

71

78

register transmit

2

4

9

scalar (register) load/store Instructions. Up to three load/store instruc-

tions may be stacked in the load/store unit at any given time. The stack is processed sequentially at a rate of 19 cycles per load and 16 cycles

133

per store minimum, assuming no memory conflicts. Once a memory bank has been referenced it remains busy for 31 cycles; during this time further references to the bank are delayed and the conflict will affect references to other memory banks. Start to finish for a single load requires at least 31 cycles, and a minimum of 66 cycles are required to store an item and reload it.

STAR

instruction

operation

issue

result to register

load store unit busy

7E

load

3

28

19

7F

s tore

3

16

branch instructions . Execution time for a branch can range from 8 to 34 cycles, depending on the particular instruction, whether a branch is actu- ally made, whether the target instruction is in the instruction stack, etc.

We will be only slightly pessimistic if we take 40 cycles to be the time for a typical branch.

vector operation instructions. Execution time consists of an initial start- up time followed by sequential appearance of result vector components, or startup time followed by processing time of operands if a scalar result is generated. Vector instructions may not be overlapped except for a negligible part of the startup time. In the table below, N is the length of the input vector(s) and V is the number of omitted items due to use of a tor. Times for instructions D8 - DC may vary as much as +20^.

control vec-

134

STAR

vector

ins truction

operation

time

effect

82

add, normalized

+ 71

c . a . + b .

Ill

86

subtract, normalized

jN + 71

c . a . - b .

ill

8B

multiply, significant

N +159

c . a . x b .

ill

SF

divide, significant

2N +167

c . «- a . + b .

ill

93

square root

2N +155

c. - *jr.

i i

98

transmit

jN + 91

c . *- a . l i

99

absolute value

|n + 91

c . «- j a . ] l 1 l 1

D8

maximum

6N - 2V + 95

c «- max a ,

N 1

DA

summation

6.5N - 2V + 122

c - a .

i-1 1

DB

product

6N - V + 113

c - "N a.

i=l i

N

DC

inner product

6N - V + 130

c - a b

i= 1

vector data

manipulation instructions.

The compress instruction transmi

selected components of one vector into a shorter result vector. Given a

vector a and a control vector z , the instruction essentiallv executes the I code

j - 1

for i = 1, 2 , . . . , N

if zi = 1 then £c ^ j - j+l}.

The result vector c has length bitsum(z) . The merge instruction "shuffles" [ two vectors into a third, according to the control vector:

135

length (c) = length(z) = length (a) + length(b) j - l; k - 1

for i = 1, 2

if z. = i

length (c)

1 then (V - a ; j - j+l} else {Ci - b ; k - k+l}

In the table below, = length (a) , ^ = lengthfc), R - number of input items broadcast from register.

STAR

instruction

BC

BD

operation

onpress a -> c per z merge a,b - c per z

time

M1 + &2 + 92 3^ - 2R + 123

Execution times for these instructions may vary as much as +20

136

APPENDIX B

SUMMARY OF ALGORITHMS AND TIMINGS

We collect the algorithms discussed in the main text, and present timing formulae in general terms for a vector computer. This is then specialized to block tridiagonal systems with 2x2 blocks using timing information for the CDC STAR- 100. A comparison of methods based on the latter set of timings is given in Section 10. C. It must be emphasized that our time estimates have not been verified by direct testing on STAR.

The algorithms solve Ax = v, A = (a . , b . , c.) with blocks of uni-

j j j N

form size n X n. The basic arithmetic operations are: (S = add or sub- tract, M = multiply, D = divide)

1. factor an n x n block b

12 13 o

cost = (n - n)D + (2n - 3n + n) (M + S)

2. having factored b, solve bg = f, where f is n x n'

2

cost * n' (nD + (n“ - n) (M 4- S))

3. product of n x n and n x n' blocks

2 2

cost = n' (n M + (n -n)S)

4. difference of two n x n' blocks cost = n' (nS)

Depending on context, the symbols S, M, D can mean either a single scalar operation or a vector operation with vector length specified by the algorithm.

For each algorithm we give a code in terms of blocks of A, since the translation to individual scalar or vector operations is generally obvious.

137

In those cases where additional detail seems to be necessary it is pro- vided. Two basic sets of timings are given, five formulae in all for each algorithm:

1. counting arithemtic operations only, assuming machine vectors can be any sequence of storage locations in arithmetic progres- sion, and ignoring pivoting, instruction overlap, address cal- culation, loop control, termination tests for iterations, load store costs for scalar operands, and any other forms of storage management

(a) for systems with n X n blocks in general terms for a vector computer

(b) for systems with n x n blocks, with timing data for the CDC STAR- 100

(c) for systems with 2x2 blocks, with STAR data

2. a more realistic model of STAR, requiring machine vectors to be contiguous storage locations, allowing overlap of scalar opera- tions, counting storage management and load'store costs, and loop control to some extent, but still ignoring Divoting, ad- dress calculations and termination tests for iterations

(a)

for

systems

with n

X n blocks.

with

STAR

data

(b)

for

sys terns

with 2

X 2 blocks.

with

STAR

data.

Additional notes are provided in many cases. Timings 1(c) and 2(b) are used in Section 10. C.

A few remarks on the timings are necessary. Since we eventually want estimates for 2x2 blocks, the only loop control that will be counted is

138

that on. the main loops. Operations on n x n matrices or n- vectors are assumed to be expanded into straightline code.

The address calculations that have been ignored here would be exe- cuted in scalar mode, and probably can be completely overlapped with scalar arithmetic operations in the LU factorization. Their effect on the odd-even methods would be to increase the 0(ml terms, m = "log^N-1 , but this probably represents only a small relative increase since the 0(m) terms consist of vector startup costs and are already large. While scaLar operations can sometimes destroy the efficiency of a vector code, we be- lieve that our basic conclusions will not be altered by including address calculations .

139

1. Block LU factorization, using scalar operations mostly Algorithm:

di * bi ; £i * vi for i = 2 , ... , N

solve d. (u. g. ) = (c. f. ) l-l l-l i-l l-l l-l

d . = b . - a . u . . l i ii-l

fi = vi - ai §1-1 solve dN xN = fN

for i = N-l, ... , 1

X. = g. - U. X.

1 1 11+1

Storage handling:

If desired we can overwrite d. b . , L. = a. d.V a u c ,

1 l i ii-l l i l

x. g. - f. -* v.. The factorization loop needs to save either 1111

2

f ^ or and g^, a total of n + n stores and subsequent loads,

d., f. are saved then the second loop would be

i i r

for i = N- 1, ... , 1

solve d.x.=f.-c.x. , ii i i i+l

with a corresponding increase in execution time.

Variations of the basic algorithm, with estimate of minimum runt due to load/store unit:

Version 1, a one-time solution.

initialization loads b^, v^

factorization loop loads c. ,, a., b., v.

i-l i l i

stores g^ l

initialization stores x^

d . and

i

If

ime

140

back substitution loop loads u., g.

11

stores x.

1

At 19 cycles per load and 16 per store,

2 2

total time £ {l9(4n“ 4- 2n) + 16(n“ 4- 2n) } (N- 1) 4- 19 (n2 4- n) 4- 16 (n-)

= (92n“ 4- 70n)(N-l) 4- 16n~ 4- 35n for n = 2, total time 5 503N - 362

Version 2, factor + solve, A = ( ^ , I, 0)(0, d^ , 0 ) (0 , I, u.)

o

Factorization requires n“(3N - 2) loads for A.

Version a. factorization looo stores d. onlv, and comoutes

i

l. - a. d.\, u, = d . ^ c. by vector operations, vector length = JJJ-ijJJ

total time £ 19n“(3N - 2) + 16n“N + -r^n" - n) D + -r(14n2 - 15n”

Z 6

D = t^N + etc.

for n = 2, total time S 323. 5N 4- 3421 Version b. factorization loop stores d^, u^, computes l. by vector operations, vector length = N. total time £ 19n2 (3N - 2) 4- 16n2(2N - 1)

+ \(2n~ - n)D 4- ~(8n2 - 9n~ 4- n) (M4-S) for n = 2, total time 5 373. 5N + 1769 Version c. factorization loop stores d., u., 1..

ill

total time £ 19n2(3N - 2) + 16n2(3N - 2) for n =* 2, total time 2 420N - 280 Solution of Ax = v, given the factorization of A.

forward elimination loads v^ initially loop loads l., v., stores f.

i i i

solve dj gj = f j with vector operations.

N.

4- n) (H4S ) ,

141

back substitution loads initially

loop loads u., g., stores x.

i i r

2

total time s 19(2n (N-l) 4- 2nN) + 16(2nN)

+ -j(n^ + n) D + -jr(2n3 + 3n~ - 5n) (M+-S ) for n = 2, total time 2 302. 5N + 1039

Timings :

1. Version 1, for initial comparison, arithmetic operations without overlap, load Stores ignored.

(a) t^(3n' + n)tD + ^(14n3 + 9n~ - 5n) (tM + tg ) In

- fn2tD + (2n3 + n2) (t^ + tg) }

(b) with STAR data: (70n3 + 114n~ - 2n)N - (60n3 4- 76n“)

(c) for n = 2, with STAR data: 1012N - 784

Version 1, for second comparison, allowing overlap and including load/ stores .

In the absence of facilities for direct testing on STAR, a scalar code was written for the special case n = 2 and time estimates computed from that code. The details are tedious and wel 1- illustrated by the simpler tridiagonal code in [L2], so they will not be given here. The main time estimates are

initialize - loads + loop set up factorization loop Initialize

back substitution loop

total time estimated to be

200 cycles 400(N- 1)

250

200(N-1)

600N - 150 cycles

This timing is probably optimistic due to unforeseen memory and

overlap conflicts.

142

Notes :

1. Gaussian elimination, taking advantage of the block structure, using

scalar operations, requires the same execution time as the block LU factorization .

2. If u., 1 s j sn-1, is saved in the factorization, then ||B[A ]||may be computed following the first loop at a cost of

( (n- 1) t 4- ~ ) (nN) + (n- 1) a + c

max + max

using the vector maximum instruction. This yields 13N 4- 166 for n = on STAR.

143

2. Odd-even elimination, using vector operations

We give the algorithm and timing for N = 2™ - 1, deriving formulae solely in terms of N and m. Timing for other values of N is approximated by setting m = ^og^N"! .

i- lx

vector

lengths

N, N-r

N-r

N-r

N-2r

N-2r

Algorithm: a ^ ^ = a^, etc.

for i = l,2,...,m (r = 2

for each j = 1,2,...,N

solve b.^(ry. y. c c.) = (a . ^ c.^ v.^) J J J J J J J

take a. = y. = 0, cp. = 0 if j < 1 or j > N

J J J

(i+1) _ (i) (i) (i)

b. -b. -a. y. -c. a.,

J J J J-r j j+r

(i+l) _ (i) o (i) (i)

v. = v . -a. co . - c . cd.

J J J J-r J J+r

(i+l) _ (i)

a . = -a . or.

J J J-r

(i+l) _ (i)

c . = -c . V . ,

J J J+r

for each j = 1,2,...,N

(nH-1) _ (riH-1)

solve b . x . = v .

J J J

Storage handling: temporary vectors are needed for the factorization of

bj ^ and for , y^ , cd^ computed at each stage, and we can overwrite

. (i+l) . (i) (i+l) (i)

b. -* b . , a v -a. , etc.

J J J J

Timing :

1. arithmetic only

(a) [r(5n2 + n)T + -|(38n3 + 3n2 - 5n) t

l D b M

+ -|(38n2 - 9n2 - 5n)Tg}Nm + (4(5n2 + n)aD + -|(38n2 + 3n2 - 5n)a^

+ ^(38n^ - 9n“ - 5n)Cg) } m

{'jPn- - n)To + -|(46n3 - 3n“ + 5n)

1,3 2 i

+ ”^(46n - 27n“ + 5n)Tg } N

+ [(2n )t + (2n3 - 2n“)T + -^(n' + n) a

M S 2 D

+ -jr(-10n3 + 3n~ - 5n)<r 6 >1

+ ^r(-10n3 + 15n- - 5n)<rs]

(b) with STAR data,

£-(33n + 19n2 - n)Nm + -^(8740n3 + 2343n~ - 649n)m -^(46n3 + n2 + n)N - |(2282n3 - 2037n2 + 649n)

(c) for n = 2, with STAR data,

94.5 Mm + 12999m - 93.5N - 1901.

2. With overwriting of a. - c^1'*, it is necessary

to develop the products one row at a time in a temporary vector, and

then use a transmit operation on the whole row. For specifics see

the discussion of odd-even reduction to follow. The added cost for

loop control and transmit instructions is m- 1 -

40m + , = 1 2n(^n(N-21) + 91) = 40m + n"(N(m-2) 4- 1) + 182n(m-l),

or 4Nm 4- 404m - 8N - 360 for n = 2.

(a) with STAR data, total time is

£•( 38n3 + 23n2 - n)Nm + |(8740n3 4- 2343n2 4- 443n 4- 240)m

-7<46n3 + 9n2 + n)N - 7(2282n3 - 2043n2 + ’74 In)

4 6

(b) for n = 2, with STAR data, total time is 98 . 5Nm 4- 13403m W01.5N - 2261.

145

Notes :

1. || B[ A ] 1 1 may be computed in the first time through the loop at a cost

of ((2n-l)T, + T ) (nN) + ((2n-l)a, + c ). This yields 15N 4- 308 for n = 2 on STAR. || B [ H ^ ] j | may be computed at a cost of 15(N - 2m 1) + 303 at the appropriate time through the loop.

146

J

3. Odd-even reduction, using vector operations

As with odd-even elimination, the algorithm and timings are derived for N = 2m - 1. For general N take n = ("log Nl in the timing formulae, even though m = riog9N+ln is the proper choice. The algorithm is given in detailed form, using the data layout and storage handling described in Sec- tion 10. B. For the basic form, see Section 4.D.

Algorithm: A ^ ^ = A, = v, N* = 2°'

?

set up control vector z = (1 0 1 0 ...), length n“N* bits cost not counted in timing, but = one instruction on STAR

for i = 1, 2, . ... m-1 (JT = 2"*’1"1' - 1, >h* = 2nH‘1'1)

,, ,(i) (i)

odd- even separate A ,

17 2

8 compress operations, cost = (3n + n)N.* + 736

O 1

(i)

factor t>2j+1v , (0 j <: Ni+1

(i)

factorization overwrites b„ . ,

2j+l

need a temporary vector, length +

cost = 4(n^ - n) D + r(2n'5 - 3n- + n) (M + S) ,

i b

vector lengths = N.

soive b2j+1(l)(*2j+1(l) v2j+1

i+1

(0

®2j+l

(iV

= (a-

(i)

"2 j+1

(i)

2j+l

overwrites a

'22+1

(i)

2 j+1

»2j+1(1)>, (0 u s»1+1)

, etc .

need a temporary vector, length cost 3 (2n + 1) (nD + (n2 - n) (M + S) ) ,

vector lengths = j*

- >2J<“ - -2J

v (1) - c 2 j- 1

2 j

(i)

(i)

2 j+1

, (i ^ j * N.+1)

overwrites b^

(i)

need a temporary vector, length

3

cost = 2n (M + S)

14 7

for simplicity, take vector lengths = N *

(i)

(i)

a„, cp.'~'-c v*' cp

O A '04—1 TO

(£> cp (i)

2j 92j+l *

(i+1) (i)

V =W2j -“2j +2j-l

overwrites w„ . ^

2 J

need a temporary vector, length 2

cost = 2n (M + S)

for simplicity, take vector lengths = N.

L (i+D =_*(!)* (i) ,7 < , < M .

j 2 j *2j-l (2 j Ni+1)

overwrites a^^ ^ Pr°dvct developed one row at a time:

for k = 1, . . . , n

for l = 1, .... n

hj ■aM.2j<i) °'n,2j-l<1)’ 5-l

for r = 2 , , n

(i)

CU Cl} * *kr.2j 0r«.2].l<i)- <l s J * *1+l>

" ^t.y (1 s L s n; 1 s j SNi+i*)

3 . 3 2

cost = n M + (n -n)S, vector length »

+ n transmits, vector length a nN.+1*.

need two temporary vectors, length = nN. , and N

i+l i+1

c.(i+1) 3 -c, v (i), (l s j £ V . i)

j 2 j v2j+l * U J Ni+1

as for a

(i+1)

j

solve Xl(m) * w^13^ , scalar mode

cost = |(n2 + n)D + ^(2n3 + 3n2 - 5n) (M + S)

for i = m- 1, . . . , 1

x (i> = cp (i) . _ (i) _ (i+D

2 j+1 *2 j+1 *2 j+1 Xj

overwrites cd,

(i)

(i) v (i+1)

T2 j+1 xj+l

- V0<1, '"x,., , (0 * j s N. + 1)

2 j+1

cost = 2n (M + S) , vector length »

148

(i+1)

Xj -* temporary

cost = transmit, vector length = nN.+^*

merge x2j+l^’ temPorary “*

cost = 3nN.* + 123 1

Timings :

1. arithmetic only,

(a) {-j(5n2 + n)TD + -|(33n3 + 15n2 - 5n)

+ -^(38n3 + 3n2 - 5n)T }(N-1)

+ {^(5n2 + n)aQ + 'jjOSn3 + 15n2 " 5n)

+ |(33n3 + 3n2 - 5n)a }(m-l)

+ {^(n2 + n) tD + ^(2n3 + 3n2 - 5n) (t^ + tg) ]

(b) with STAR data: ^-(38n3 + 31n^ - n) (N- 1) +

4(3740n3 + 5103n2 - 649n)(m-]) + o

(10n3 + 38n2 - 2n)

(c) for n = 2, with STAR data: 106. 5N + 14839m - 14717.5

2. with loop control and storage manipulation (compress, merge, transmit) using STAR data,

(a) for general n, add 80m + ^-(55n2 + 43n)(N-l) + (182n + 950) (m-1)

(b) for n = 2, add 76. 5n + 1394m - 1390.5, so total time = 183N + 16233m - 16108

L

149

Notes :

1. An estimate for || B[A]|| may be computed in the first time through the first loop at a cost of ((2n-l)T + r ) (Nn/2) + ((2n-l)a + a ).

This yields 7.5N + 308 for n = 2 on STAR. The estimate of || B[A]|| is simply the maximum row sum taken over every other block row. The true value of || [A] || could be found by computing the remaining rows of B[A], but this is more expensive and defeats the purpose of odd-even reduction.

150

4. p-fold reduction

The algorithm and timings are given for N = p™ - 1, p > 3. For general

N, take m * flog n! = m/log p.

pNi = m/log2?, m = log2N ' ; m =

log N+l is actually the P

proper choice.

Algorithm: (see Section 6. A for details), N = d

l r

rrrt-1- i

for i = 1, .... in- 1

for each k = 1, .... Ni+1 + 1,

LU process on

for each k = 1, .... JJ. ,

l+l

UL process on ^ generate A<'i+*\ solve b^ x^ = for i = a-1, . . . , 1

for each k = 1, .... N.. + 1

back substitute for

\p-j- 1 5 J

Timing :

2 13 2

1. (a) for general n, f(5n + n)r +T(2on + 3n - 5n) t ,

+ -j(26n3 - 3n2 - 5n)-rs3 (N-p+1)

+ [5n + n)cD + -j(26n3 + 3n2 - 5n)^

+ -|(26n3 - 3n2 - 5n) cs } (m- 1) (p- 1)

+ !^(n2 + n)tD + ^(2n2 + 3n2 - 5n)(tM+ts)} 13 2

(b) with STAR data, 2(26n + 21n - n) (N - p + 1)

+ -j(5980n + 2769n“ - 649n) (m-l)(p-l)

+ (10n3 + 38n2 - 2n)

151

(c)

using STAR data, n = 2: 145N + (19206m - 19351) (p- 1) + 228

= 145N + 19206m

(19351p - 19579)

2. data manipulation costs are outlined in Section 10. B. We have not

derived the complete timing formulae, as the method is expected to be too slow.

Notes :

1. In the arithmetic timing, for p s 3, p-fold reduction is always cheaper than (p-f-l)-fold reduction for any n. Comparison of 2-fold (odd-even) and 3-fold reduction yields an N(n) such that if N s N(n)then odd-even reduction is faster. Although we have not determined N(n) , the block LU factorization will probably be faster than either reduction method for N < N (n) . Similar results should hold when overhead costs are included in the time estimates.

152

5 . Mixed odd-even reduction and LU factorization

Algorithm: do k odd-even reductions, solve A('k"rl') x('K+1^ = w<'k+^ by the LU factorization, and back substitute.

Timing: (based on N = 2m - 1)

1 (a) [j(5n“ + n) 4- ^(38n3 + 15n2 - 5n)

+ -|(3Sn3 + 3n2 - 5n)-r }(N + 1 - 2m-k)

1 9 13 9

+ IjCSn" + nlo^ + (38n + 15n“ - 5n)<7^

+ |(33n3 + 3n2 - 5n);r }k

+ {— (3n + n) tQ + -|(14n3 + 9n~ - 5n)(t^ + tg)}(2m k-l)

- (n2tD + (2n3 + n2)(tM + tg)}

The timing difference between odd-even reduction and the mixed algo- rithm will be an exoression of the form 'y (m-k) - 3 (2m_k) - v , and k

n n n

should be chosen to maximize this expression. For n = 2 using STAR data, we have -y^ = 14839, 3^ = 905.5, v = 13023, k = m-5, yielding

(c) 106. 5N + 14839m - 46908.5

2. For n = 2 and STAR data, the polyalgorithm time is

183N + 16233k + 417(2 k) + 33, and the optimal choice is k = m-6, yielding

(b) 183N + 16233m - 70677

15 3

6 . Incomplete reduction

Algorithm: do k odd-even reductions, solve D[A^k+^ ] y back substitute to obtain y = y^ ^ =

(k+l) _ (k+1)

W

X.

Timing: (based on N = 2m - 1)

1. (a) {|(5n2 + n)T + -r(38n3 + 15n2 - 5n)T

^ Do M

+ x(38n3 + 3n2 - 5n)r } (N 4- 1 - 2m“k)

D S

+ {jCSn- 4- n)uD + -|(38n3 4- 15n" - 5n)

+ (38n3 4- 3n“ - 5n)cr„}k 6 a

+ '^(n2 + n)To + |(2n3 + 3n“ - 5n) (t^ + Ts)}(2m"k - 1)

+ {^O- 4- n)a + r(2n3 4- 3n2 - 5n) (e 4- a ) }

L Do Mb

(b) general n, STAR data:

£(38n3 4- 31n2 - n)N 4- |(8740n3 4- 5103n2 - 649n) k - (9n3 4- 6n2)(2m"k - 1) 4- -|(460n3 4- 1191n2 - 649n)

(c) using STAR data, n = 2: 106. 5N 4- 14839k - 96(2m_k) 4- 1287

2. (a) general n, STAR data:

(b)

^(38n3 4- 86n2 4- 42n)N 4- |(8740n3 4- 5103n2 4- 443n 4- 5700)k - ^(36n3 4- 79n2 4- 51n)(2m_k - 1)

13 2

4- (460n 4- 119 In - 1651n) 4- 80(k4-l)

n = 2, STAR data:

183N 4- 16233k - 176.5(2m'k) 4- 1113.5

and

154

7 . Jacobi iteration, serai- Iteration

basic step: = B[A]y^ + D[A]

for each j = 1, . . . , N

, , (i+1) (i) (i)

solve b.y. = v. - a.y. , - c.y. ,

J J J J J-l J J+l

Timing (identical for both vector computer models) :

1. (a) j(n2 + n) (tqN + crQ) + -|(2n3 + 15n^ - 5n)((TM + -g)N + (ct^ + cj)

1. (b) = 2. (a) using STAR data,

^(2n3 + 19n2 - n)N + |(460n3 + 3951n2 - 649n)

1. (c) = 2. (b) using STAR data, n = 2: 22. 5N 4- 3031

Optional preprocessing, times for n = 2, STAR data

(1) solve b.(a. v. cp.) =(a. c. v.),

J J J J J J J

cost = 38. 5N + 4367. The basic step is then

(i+1) _ (i) (i)

y . = co. - a.y . , - v.y . , , ,

J J J J-l J J+l

cost = 12N + 1840.

(2) factor b ^ , cost = 3.5N + 367. The basic step is unchanged from the original, but cost is 19N + 2634. This is used in the mixed incom- plete reduction/ Jacobi algorithm.

Semi- iteration, basic step:

y^m3 = result of Jacobi step from y^ - .al (y(m) - + y'"-1’

Vi 1/(1 -

p - p(B[A]) .

p

- ^

Timing, n = 2, using STAR data: 26. 5N + 3408, or 16N + 2217 with the first preprocessing .

Notes :

]

1. Varga [V3, p. 139] shows that for Jacobi semi- iteration ,

A conservative estimate for the number of iterations needed to obtain || e^ || ^ 1/n || e^ || is n = 21ogN (- log (x^- 1) ) It follows that the execution time for an iterative solution is 0(N log N) vs. 0(N)

for the better direct methods.

8 . Incomplete reduction plus Jacobi iterations

Algorithm: do k odd-even reductions, solve D[A^k+^ ]y(k+^ = w^k+^ f do

l Jacobi iterations to improve y(k+l)^ and back substitute.

4 , r , , (k+1) n (k+1) (k+1) , f ,

As a result of solving D[A Jy = w , we have factors for

D[A^k+^]. This means the cost is (using STAR data, n = 2)

rn m If m. 1/

1. (c) 106. 5N + 14339k - 96(2 K) + 1287 + 2(19(2 ) + 2615).

2. (b^ 133N + 16233k - 176.5(2m"k) + 1113.5

+ 2(19(2m'k) + 2615) .

157

REFERENCES

[ B 1 ] I. Babuska, "Numerical stability in mathematical analysis," IF IP

Congress 1968, Nor th-Hol land , Amsterdam, pp. 11-23.

[B2] I. Babuska, Numerical stability in the solution of tridiagonal

matrices, Rept. BN-609, Inst, for Fluid Dynamics and Appl. Math., Univ. of Md., 1969.

[B3] I. Babuska, "Numerical stability in problems of linear algebra," SIAM J. Numer. Anal., vol. 9, 1972, pp. 53-77.

[ B4 ] D. W. Bailey and D. E. Crabtree, "Bounds for determinants," Lin.

Alg . ApdI. , vol. 2, 1969, pp . 303-309.

[85] R. E. Bank, "Marching algorithms and block Gaussian elimination," in [ B16 , pp. 293-307].

[B6 ] G. H. Barnes, R. M. Brown, M. Kato, D. J. Kuck, D. L. Slotnick,

R. A. Stoker, "The Illiac IV computer," IEEE T-ans . on Como., vol. C-17, 1968, pp. 746-757.

[ B 7 ] G. M. Baudet, Asynchronous iterative methods for multiprocessors,

Dept, of Computer Science, Carnegie-Mellon Univ., Nov. 1976.

[B8] F. L. Bauer, "Der Newton- Prozess als quadratisch konvergente

Abkurzung des allgemeinen linearen stationaren Iterationsverfahrens 1. Ordnung (Wittmeyer- Prozess) ," Z. Angew. Math. Mech., vol. 35, 1955, pp. 469-470.

[B9] A. Benson and D. J. Evans, "The successive peripheral block over- relaxation method," J. Inst. Maths. Applies., vol. 9, 1972, pp. 68-79.

[BIO] A. Benson and D. J. Evans, "Successive peripheral overrelaxation and other block methods," J. Comp Phys , , vol. 21, 1976, pp. 1-19.

[Bll] G. Birkhoff and A. George, "Elimination by nested dissection," in [Tl, pp. 221-269].

[B12] W. J. Bouknight, S. A. Denenberg, D. E. McIntyre, J. M. Randall,

A. H. Sameh, D. L. Slotnick, "The Illiac IV system," Proc . IEEE , vol. 60, 1972, pp. 369-388.

[B13] A. Brandt, Multi-level adaptive techniques I. The multi-grid method, Rept. RC 6026, IBM T. J. Watson Res. Cen., Yorktown Heights, N. Y. , 1976.

[B14] J. L. Brenner, "A bound for a determinant with dominant main diagonal," Proc. AMS, vol. 5, 1954, pp. 631-634.

158

[ B 15 ] C. G. Broyden, "Some condition number bounds for the Gaussian elimination process," J. Inst. Maths. Applies., vol. 12, 1973, pp. 273-236.

[B16] J. R. Bunch and D. J. Rose, eds . , Sparse Matrix Computations, Academic Press, N. Y. , 1976.

[B17] B. L. Buzbee, Application of fast Poisson solvers to the numerical approximation of parabolic problems. Rent. LA-4950-T, Los Alamos Sci. Lab., 1972.

[B18] B. L. Buzbee and A. Carasso, "On the numerical computation of para- bolic problems for preceding times," Math. Comp., vol. 27, 1973, pp. 237-266.

[B19] B. L. Buzbee, F. W. Dorr, J. A. George and G. H. Golub, "The direct solution of the discrete Poisson equation on irregular regions," SIAM J. Numer. Anal., vol. 8, 1971, pp. 722-736.

[B20] B. L. Buzbee, G. H. Golub and C. W. Nielson, "On direct methods for solving Poisson's equations," S IAM J . Numer . Ana 1 . , vol. 7,

1970, pp. 627-656.

[Cl] A. Carasso, "The abstract backward beam equation," S LAM J . Ma th .

Ana 1 , vol. 2, 1971, pp. '93-212.

[C2] Control Data Corporation, CDC STAR- 100 Instruction execution times, Arden Hills, Minn., Jan. 1976; in conjunction with measurements made at Lawrence Livermore Laboratory.

[C3] A. K. Cline, personal communication, 1974.

[C4] P. Concus and G. H. Golub, "Use of fast direct methods for the effi- cient solution of nonseparable elliptic equations," SLAM J. Numer. Anal. , vol. 10, 1973, pp. 1103-1120.

[C5] P. Concus, G. H. Golub and D. P. O'Leary, "A generalized conjugate gradient method for the numerical solution of elliptic partial dif- ferential equations," in [B16, pp. 309-332].

[C6] S. D. Conte and C. de Boor, Elementary Numerical Analysis, McGraw- Hill, N. Y., 1972.

M. G. Cox, "Curve fitting with piecewise polynomials," J. Inst. Maths. Applies., vol. S, 1971, pp. 36-52.

Cray Research, Inc., "Cray- 1 Computer," Chippewa Falls, Wis., 1975.

M. A. Diamond and D. L. V. Ferreira, "On a cyclic reduction method for the solution of Poisson's equations," SIAM J. Numer. Anal., vol. 13, 1976, pp. 54-70.

F. W. Dorr, "The direct solution of the discrete Poisson equation on a rectangle," SIAM Review, vol. 12, 1970, pp. 248-263.

159

[D3] F. W. Dorr, "An example of ill-conditioning in the numerical solu- tion of singular perturbation problems," Math. Comp., vol. 25.

1971, pp. 271-283.

[El] S. C. Eisenstat, M. H. Schultz and A. H. Sherman, "Applications of an element model for Gaussian elimination," in [B16, pp. 85-96].

[E2] D. J. Evans, "A new iterative method for the solution of sparse systems of linear difference equations," in [R6, pp. 89-100],

[E3] R. K. Even and Y. Wallach, "On the direct solution of Dirichlet's problem in two dimensions," Computing , vol. 5, 1970, pp. 45-56.

[FI] D. C. Feingold and R. S. Varga, "Block diagonally dominant matrices and generalizations of the Gerschgorin circle theorem," Pacific J. Math. , vol. 12, 1962, pp. 1241- 1250.

[F2] D. Fischer, G. Golub, 0. Hald, J. Leiva and 0. Widlund, "On

Fourier-Toeplitz methods for separable elliptic problems," Math

Comp . , vol. 28, 1974, pp. 349-368.

[F3] G. von Fuchs, J. R. Roy and E. Schrem, "Hypermatrix solution of

large sets of symmetric positive definite linear equations," Comp . Math, in Aopl. Mech. and Engng., vol. 1, 1972, pp. 197-216.

[Gl] J. A. George, "Nested dissection of a regular finite element mesh," SIAM J. Numer. Anal., vol. 10, 1973, pp. 345-363.

[HI] L. A. Hageman and R. S. Varga, "Block iterative methods for cyclic- ally reduced matrix equations," Numer. Math., vol. 6, 1964, pp. 106- 119.

[H2] D. E. Heller, "Some aspects of the cyclic reduction algorithm for block tridiagonal linear systems," SIAM J. Numer. Anal,, vol. 13, 1976, pp. 484-496.

[H3] D. E. Heller, A survey of parallel algorithms in numerical linear algebra. Dept, of Comp. Sci., Carnegie- Me 11 on Univ. , 1976; SIAM Review, to appear.

[H4] D. E. Heller, D. K. Stevenson and J. F. Traub, "Accelerated itera- tive methods for the solution of tridiagonal systems on parallel computers," J.ACM, vol. 23, 1976, pp. 636-654.

[H5] R. W. Hockney, "A fast direct solution of Poisson's equation using Fourier analysis," J.ACM, vol. 12, 1965, pp. 95-113.

[H6] R. W. Hockney, "The potential calculation and some applications," Methods in Computational Physics, vol. 9, 1970, pp. 135-211.

[H7] E. Horowitz, "The application of symbolic mathematics to a singular perturbation problem," Proc. ACM Annual Conf., 1972, vol. II, pp. 816-825.

150

[H8] A. S. Householder, The Theory of Matrices in Numerical Analysis, Blaisdell, N. Y., 1964.

[Jl] T. L. Jordan, A new parallel algorithm for diagonally dominant tridiagonal matrices, Los Alamos Sci. Lab., 1974.

[J2] M. L. Juncosa and T. W. Mullikin, "On the increase of convergence rates of relaxation procedures for elliptic partial difference equations," J.ACM, vol. 7, 1960, pp. 29-36.

[Kl] W. Kahan, Conserving confluence curbs ill-condition, Dept, of Comp. Sci., Univ. of Calif., Berkeley, 1972.

[K2] H. T. Kung, "On computing reciprocals of power series, " Numer . Math. , vol. 22, 1974, pp. 341-348.

[K3] H. T. Kung and J. F. Traub, All algebraic functions can be computed fast, Dept, of Comp. Sci., Carnegie-Me lion Univ., July 1976.

[LI] J. J. Lambiotte, Jr., The solution of linear systems of equations on a vector computer, Dissertation, Univ. of Virginia, 1975.

[L2] J. J. Lambiotte, Jr. and R. G. Voigt, "The solution of tridiagonal linear systems on the CDC STAR- 100 computer," ACM l'rans. Math. Software, vol. 1, 1975, pp. 308-329.

[Ml] N. Madsen and G. Rodrigue, A comparison of direct methods for tri- diagonal systems on the CDC STAR-100, Lawrence Livermore Lab.,

1975 .

[M2] M. A. Malcolm and J. Palmer, "A fast direct method for solving a

class of tridiagonal linear systems," CACM, vol. 17, 1974, op. 14- 17.

[M3] W. Miller, "Software for roundoff analysis," ACM Trans. Math. Software , vol. 1, 1975, pp. 108-128.

[Nl] A. K. Noor and S. J. Voigt, "Hypermatrix scheme for finite element systems on CDC STAR- 100 computer," Computers and Structures, vol.

5, 1975, pp. 287-296.

[01] J. M. Ortega and R. G. Voigt, Solution of partial differential equations on vector computers, ICASE, Hamton, Va., 1977.

[02] A. M. Ostrowski, "ijber die Determinanten rait uberwiegender Haupt- diagonale," Comment. Math. Helv., vol. 10, 1937, pp. 69-96.

[PI] G. Peters and J. H. Wilkinson, "On the stability of Gauss- Jordan elimination with pivoting," CACM, vol. 18, 1975, pp. 20-24.

[Rl] J. K. Reid, ed., Large Sparse Sets of Linear Equations, Academic Press, London, 1970.

161

[R2] J. K. Reid, "On the method of conjugate gradients for the solu- tion of large sparse systems of linear equations," in [Rl, pp. 231-254].

[R3] F. Robert, "Blocks-H-matrices et convergence des methods itera- tives classiques par blocks," Lin. Alg . Appl. , vol. 2, 1960, pp. 223-265. ~

[R4] D. J. Rose and J. R. Bunch, "The role of partitioning in the numer- ical solution of sparse systems," in [R6, pp. 177-187],

[R5] D. J. Rose and G. F. Whitten, "A recursive analysis of dissection strategies," in [B16, pp. 59-83].

[R6] D. J. Rose and R. A. Willoughby, eds . , Sparse Matrices and Their Applications , Plenum Press, N. Y., 1972.

[R7] J. B. Rosser, The direct solution of difference analogs of

Poisson's equation, rep. MRC-TSR- 797 , Math. Res. Cen., Madison, Wis., 1967.

[51] J. Schroder and U. Trottenberg, "Reduktionsverfahren fur Differ- enzengleichungen bei Randwertaufgaben I," N'uner . Math., vol. 22,

1973, pp. 37-68.

[Sla] J. Schroder, U. Trottenberg and H. Reutersberg, "Reduktionsver- fahren fur Dif f erenzengleichungen bei Randwertaufgaben II," Numer . Ma th . , vol. 16, 1976, pp. 429-459.

[52] H. S. Stone, "ParalleL tridiagonal equation solvers," ACM Trans Math. Software, vol. 1, 1975, pp. 289-307.

[53] H. S. Stone, ed., Introduction to Computer Architecture, Science Research Associates, Palo Alto, Calif., 1975.

[54] G. Strang and G. J. Fix, An Analysis of the Finite Element Method, Prentice-Hall, Englewood Cliffs, N. J. , 1973.

[55] P. N. Swarztrauber , "The direct solution of the discrete Poisson equation on the surface of a sphere," J. Comp. Phys., vol. 15,

1974, pp. 46-54.

[56] P. N. Swarztrauber, "A direct method for the discrete solution of separable elliptic equations," SIAM J. Numer. Anal., vol. 11,

1974, pp. 1136-1150.

[57] P. N. Swarztrauber and R. A. Sweet, "The direct solution of the discrete Poisson equation on a disc," SIAM J. Numer. Anal., vol.

10, 1973, pp. 900-907.

[58] P. Swarztrauber and R. Sweet, Efficient FORTRAN subprograms for the solution of elliptic partial differential equations, Rept. NCAR-TN/lA- 109, Nat. Cen. for Atmospheric Res., Boulder, Col.,

1975.

162

[S9] R. A. Sweet, "Direct methods for the solution of Poisson's equa-

tion on a staggered grid," J . Comp Phys . , vol. 12, 1973, do. 422- 42S .

[510] R. A. Sweet, "A generalized cyclic reduction algorithm," SIAM J. Numer Anal . , vol. 11, 1974, pp. 506-320.

[ 5 1 1 ] R. A. Sweet, "A cyclic reduction algorithm for block tridiagonal systems of arbitrary dimension," contributed paper, Symp . on Sparse Matrix Computations, Argonne Nat. Lab., Sept. 1975.

[Tl] J. F. Traub, ed.. Complexity of Sequential and Parallel Numerical Algorithms , Academic Press, N. Y. , 1973.

[T2] J. F. Traub, "Iterative solution of tridiagonal systems on parallel or vector computers," in [Tl, pp. 49-82].

[T3] J. F. Traub, ed. , Analytic Computational Complexity, Academic Press, N. Y., 1976.

[VI] J. M. Varah, "On the solution of block tridiagonal systems arising from certain finite difference equations." Math. Comp., vol. 26, 1972, pp. 359-868.

[V2] J. M. Varah, "A comparison of some numerical methods for two-point boundary value problems,” Math . Comp , vol. 28, 1974, pp. 743-755.

[V3] R. S. Varga, Matrix Iterative Analysis, Prentice-Hall, Englewood Cliffs, N. J., 1962.

[V4] R. S. Varga, "On recurring theorems on diagonal dominance," Lin .

Alg . Appl , vol. 13, 1976, pp. 1-9.

[Wl] W. J. Watson, "The Tl ASC, a highly modular and flexible super- computer architecture," AFIPS Fall 1972, AFIPS Press, Montvale,

N. J., vol. 41, pt. 1, pp. 221-229.

[W2 ] 0. B. Widlund, "On the use of fast methods for separable finite

difference equations for the solution of general elliptic problems," in [R6 , pp. 121-131].

[W3 ] J. H. Wilkinson, "Error analysis of direct methods of matrix inver- sion," J.ACM, vol. 8, 1961, pp. 281-330.

[W4 ] W. A. Wulf and C. G. Bell, "C.mmp, a multi-mini-processor," AFIPS Fall 1972, AFIPS Press, Montvale, N. J., vol. 41, pt. 2, pp. 765- 777 .

[yl] D. Y. Y. Yun, "Hensel meets Newton - algebraic constructions in an analytic setting," in [T3, pp. 205-215].

II

SECURITY CLASSIFICATION OF THIS PAGE (Whtn Data Enter'd)

REPORT DOCUMENTATION PAGE

t. REPORT NUMBER

4 TITLE (and Su6f/f/e)

READ INSTRUCTIONS BEFORE COMPLETING FORM

3. RECIPIENT'S CATALOG NUMBER

5. TYPE OF REPORT & PERIOD COVERED

DIRECT AND ITERATIVE METHODS FOR BLOCK TRIDIAGONAL LINEAR SYSTEMS

Interim

6 PERFORMING ORG. REPORT NUMBER

7. AUTHORS.)

Don Eric Heller

9- PERFORMING ORGANIZATION NAME AND ADDRESS

Carnegie-Mellon University Computer Science Dept.

i s?i

II. CONTROLLING OFFICE NAME AND ADDRESS

Office of Naval Research Arlington, VA 22217

8. CONTRACT OR GRANT NUMBERf*.)

MCS75-222-55

N00014- 76-C-0370 ;

NR 044-422

10 PROGRAM ELEMENT. PROJECT, TASK AREA 4 WORK UNIT NUMBERS

12 REPORT DATE

April 1977

13, NUMBER OF PAGES

169

14 MONITORING AGENCY NAME 9 ADDRESS (II dltlerent front Controlling Office) IS SECURITY CLASS, fo I t hie report)

UNCLASSIFIED

15a, DECLASSIFICATION DOWNGRADING SCHEDULE

16. DISTRIBUTION STATEMENT (of this Report)

Approved for public release; distribution unlimited.

17 DISTRIBUTION STATEMENT (of the abstract entered In Block 20, It different from Report)

19 KEY WORDS (Continue on reverse side If necessary and Identify by block number)

20 ABSTRACT (Continue on reverse side It necessary and Identity by block number) Block tridiagOnal Systems of

linear equations occur frequently in scientific computations, often forming the :ore of more complicated problems. Numerical methods for solution of such sys- :ems are studied with emphasis on efficient methods for a vector computer. A :onvergence theory for direct methods under conditions of block diagonal domi- lace is developed, demonstrating stability, convergence and approximation prop- irties of direct methods. Block elimination (LU factorization) is linear, :yclic odd-even reduction is quadratic, and higher-order methods exist. The dd-even methods are variations of the Quadratic Newton iteration for the

DO , J AN^73 1473 edition of i mov «5 is obsolete continued

S/N 0102-014- 6601

SECURITY CLASSIFICATION of THIS PAGE f«7i»n Dele Entered)

UNCLASSIFIED

liuHITV CLASSIFICATION or THIS PAGEfl »hen Palm Enfrid)

20. abstract continued

inverse matrix, and are the only quadratic methods within a certain reasonable class of algorithms. Semi-direct methods based on the quadratic convergence of odd-even reduction prove useful in combination with linear iterations for an approximate solution. An execution time analysis for a pipeline computer is given, with attention to storage requirements and the effect of machine constraints on vector operations.

UNCLASSIFIED

SECURITY CLASSIFICATION OF THIS PAGEnr?i»n Oafs Enfrtd)